AD-A173 022 


, Productivity Engineering in the UNIXf Environment 


The Design and Evaluation of a High Performance Smalltalk System 


Technical Report 


S. L. Graham 
Principal Investigator 

(415) 642-2050 




> 



r'v 

•"} 

* 

i 


4 


“The views and conclusions contained in this document are those of the authors and 
should not be interpreted as representing the official policies, either expressed or implied, 
of the Defense Advanced Research Projects Agency or the U.S. Government." 



Contract No. N00039-84-C-0080 
August 7, 1084 • August 6, 1087 


Arpa Order No. 4871 



fUNIX is a trademark of AT&T Bell Laboratories 


CLEARED 

FOR OPEN PUBLICATION 

SEP 23 886 3 

OmECTORATt f UR FREE 00V Or INFORMATION 
AND SECURITY REVIEW (0AS0--PA) 
DEPARTMENT OF 8EFENSE 


Thi has been approved 

for p’’bli: rolea.ie and sale; its 
d: i. ionic unlimited. 


86 4034 


86 & Z<f 


WAV.v.v.v; V 


.> .•» .•* *» 







I 

\ 



: 


« 

* 


. 


The Design and Evaluation of 
A High Performance 
Smalltalk System 


David Michael Ungar 
February, 1986 






Abstract 


The Smalltalk-80^^ system makes it possible to write programs quickly by providing 
object-oriented programming, incremental compilation, run-time type checking, use-extensible data 
types and control structures, and an interactive graphical interface. However, the potential savings 
in programming effort have been curtailed by poor performance in widely available computers or 
high processor cost. Smalltalk-80 systems pose tough challenges for implementors: dynamic data 
typing, a high-level instruction set, frequent and expensive procedure calls, and object-oriented 
storage management. 


The dissertation documents two results that run counter to conventional wisdom: that a 
reduced instruction set computer can offer excellent performance for a system with dynamic data 
typing such as Smalltalk-80, and that automatic storage reclamation need not be time-consuming. 


This project was sponsored by Defense Advance Research Projects Agency (DoD) ARP A 
Order No. 3803, monitored by Naval Electronic System Command under Contract No. 
N00034-K-0251. It was also sponsored by Defense Advance Research Projects Agency (DoD) 
ARP A Order No. 4871, monitored by Naval Electronic Systems Command under Contract No. 
N00039-84-C-0089. 




‘ v.* « 


-r . 


•A.N .N ,V,v 


v;. 




Th« Design and Evaluation 
of a High Parfornanca Smalltalk System 

By- 

David Michael Unger 

B.S. in Electrical Engineering (Washington University, Missouri) 1976 
B«S. in Applied Math & Computer Science (Washington University, Missouri) 1976 
M.S. (Washington University, Missouri) 1977 

DISSERTATION 

Submitted in partial satisfaction of the requirements for the degree of 

DOCTOR OF PHILOSOPHY 
in 

Computer Science 
in the 

GRADUATE DIVISION 
OF THE 

UNIVERSITY OF CALIFORNIA, BERKELEY 


Approved: . 

Chairman 

• • # • » • y«v« « « f » • ^• • • 

lx). 

S • • •/» #•••••• • » • V • • • ••••••• 


X/10/24 


Date 


Z6 FSB 9C 











,1 


The Design and Evaluation 
of a High Performance Smalltalk System 


Copyright © 1986 


David Michael Ungar 





The Design and Evaluation of a High Performance Smalltalk System 


Ph.D. 


David Michael Ungar Computer Science 


Sponsors: Defense Advanced Research Projects Agency 
International Business Machines Corperation 


1). A 


Abstract 

The Smalltalk-80™ system makes it possible to write programs quickly by provi ding 
object-oriented programming, incremental compilation, run-time type checkin g, 
user-extensible data types and control structures, and an interactive graphical interface. 
However, the potential savings in programming effort have been curtailed by poor perfor¬ 
mance in widely available computers or high processor cost Smalltalk-80 systems pose 
tough challenges for implementors: dynamic data typing, a high-level instruction set, fre¬ 
quent and expensive procedure calls, and object-oriented storage management 

To solve these problems, a group of researchers at U. C. Berkeley has designed and 
built the SOAR (Smalltalk On A RISC) microprocessor. In order to determine the perfor¬ 
mance of Smalltalk-80 on SOAR and to evaluate foe importance of each of the ideas, simu¬ 
lations of five representative benchmarks have been analysed. The results suggest that: 

• Six ideas substantially improve performance: compilation to a low-level instruction 
set multiple windows of on-chip registers, caching the target of a call instruction in the 
instruction itself, byte insert and extract instructions, instructions for arithmetic and 
comparison operations on tigged integers, and our storage management algorithm, 
Generation Scavenging. 



















2 

• Seven features contribute little to performance: shadow registers to simplify trap 
recovery, hardware assistance for garbage collection, vectored traps, addressable regis¬ 
ters, clearing multiple registers in parallel, conditional trap instructions, and load- and 
store-multiple instructions. 

• The language-specific hardware in SOAR doubles its performance over a RISC II with 
die same cycle time. 

• Generation Scavenging, a storage reclamation algorithm developed by die author, con¬ 
sumes only 3% of the CPU time, in contrast to the 9% of comparable Smalltalk-80 sys¬ 
tems. 

• Despite a five-to-one handicap in basic cycle time, the NMOS SOAR microprocessor 
should run as last an ECL Dorado minicomputer. 

The dissertation reports two results that run counter to conventional wisdom: that a 
reduced instruction set computer can offer excellent performance for a system with dynamic 
data typing such as Smalltalk-80, and that automatic storage reclamation need not be 
time-consuming. 
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Chapter 1 


Introduction 


Mooes and Junes and ferns wheels 
the dizzy dancing way you feel. 

As every fairy tale 'comes real 
I've looked at SOAR that way... 

I’ve looked at SOAR from both sides now, 
from win and lose, and still somehow 
It’s SOAR’s solutions I recall. 

I really don't know SOAR, at all. 

“Both Sides Now’’, 

(with apologies to) Joni Mitchell 


Computer hardware technology has improved dramatically in the past decade. Com¬ 
puters now cost less, run faster, and have more space for programs and data. This advance in 
hardware has created a demand for larger and more complex software. Unfortunately, 
software productivity has not kept pace with hardware technology, leading to a “software 
crisis.” 

Hie Smalltalk-80 system provides an environment that fosters rapid program develop¬ 
ment The system itself was developed on a large, high-speed, $100,000 personal computer, 
and most commercially available microprocessors, that are much more widely available, 
cannot run it even half as fast Regretfully, this lack of widely available high-performance 
implementations has severely curtailed the system’s acceptance. 

It may be possible to surmount this obstacle with a reduced instruction set computer 
(RISC) architecture. Such processors have demonstrated excellent cost-performance for 
more conventional systems. However, RISCs have an architectural style that runs counter to 
the conventional wisdom for exploratory programming environments, such as Smalltalk-80. 
Instead of an instruction set that reflects the semantics of the source language, a RISC 







instruction set reflects the demands of fast instruction decoding and execution. 

We have investigated whether a reduced instruction set computer can provide good 
performance for the Smalltalk-80 system. To this end we have analyzed the architecture of 
and designed and analyzed the software algorithms for a reduced instruction set microcom¬ 
puter system intended to run the Smalltalk-80 exploratory programming environment at full 
speed. This system matches the performance of the fastest Smalltalk-80 implementations to 
date (1986), yet runs at slower clock and memory speeds. The machine is called SOAR, for 
Smalltalk On A RISC. Our colleagues have built two VLSI implementations of SOAR: an 
NMOS chip (Figure 1.1) which has correctly run diagnostics, and a CMOS chip. In addi¬ 
tion, two Multibus™-compitible boards have been designed by others to host our chip in a 
Sun 68010 workstation [B1D83, Bro84]. Our ultimate goal is to demonstrate SOAR in a run¬ 
ning Smalltalk-80 system. 

We have also built Berkeley Smalltalk (BS) [UnP83], a Smalltalk interpreter for the 
MC68010 that runs on the Sun workstation. It has served as a test bed for many of our ideas 
and as a source of information about the time-consuming operations required to support die 
Smalltalk-80 system. 

SOAR is a concoction of compiler technology, run-time software, architecture, and 
VLSI circuit design. This dissertation focuses on SOAR’s architecture and run-time support 
software: what SOAR is, how it was designed, and why it works. 

• The next chapter describes the previous work in this area. It starts with a brief descrip¬ 
tion of some exploratory programming environments (EPEs), with particular emphasis 
on die Smalltalk-80 EPE. It continues with a survey of architectures that supported 
EPEs. Until SOAR, these systems pushed the source-level semantics into the 
hardware, sacrificing either simplicity or performance. The last part of this chapter 
covers previous reduced instruction set computers, which were all designed for 
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languages in the Algol family. SOAR is die first reduced instruction set architecture 
for an exploratory programming environment 

Chapter 3 enumerates die problems that Small talk-80 presents and the solutions in 
SOAR’s architecture. The effectiveness of each solution is represented by the time 
cost of its omission, based os data gathered from simulations. Table 1.1 summarizes 
tbese results. 

Chapter 4 casts a critical eye on SOAR’s architecture. Simulation results show that a 
400 ns SOAR will match the performance of a 70 ns ECL minicomputer. It will also 
run at about the same speed as an MC68020 microprocessor with a 60 ns clock, 270 ns 
memory, an on-chip instruction cache, and eight times more transistors than SOAR. 
To understand SOAR’s speed, its architectural features are listed in order of effective¬ 
ness, from success es to failures. These results show that SOAR'S language-specific 
features approximately double performance. 

Chapter 3 delves into object-oriented storage management — a considerable source of 
overhead and complexity for many Smalltalk-80 systems. For SOAR, we have devised 

Smalltalk-80 performance challenge: 

SOAR feature significaocc 

type Checking: 


tagged integers 

26% 

two-tone instructions 

16% 

nterpretation: 

compiling to RISC instructions. 

-100% 

byte insen/extract instructions 

33% 

Procedure Calls: 

register windows 

46% 

in-line cache 

33% 

fast shuffle 

11% 

Object Oriented Storage Manai 

lement: 

direct pointers 

20% 

generation scavenging 

10% 
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Generation Scavenging, a software algorithm that cuts automatic storage reclamation 
overhead from 11% to 3 fc, reclaims circular structures, and provides an additional 
20% performance improvement by eliminating a level of indirection. In addition to 
virtually eliminating die time cost of garbage collection, this algorithm allows us to 
remove object-oriented addressing from the architecture. 

Chapter 6 furnishes some proposals for coping with medium lifetime objects and an 
analytical investigation of them. 

Finally, the concluding chapter presents die lessons we have learned from SOAR and 
our recommendations for future designs. 

The appendices supplement die performance evaluation of SOAR’s architecture: 
Appendix A contains a detailed analysis of each feature’s impact on speed and 
memory size, and Appendix B gives our raw performance data. 
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The Cedar programming environment was also designed to enhance programming pro¬ 
ductivity, but has taken a different tack front Smalltalk and Interlisp 
{DeT80,Tei84,Tei83,SZH85,Rov84]. Smalltalk and Interiisp minimize the length of pro¬ 
grams and reduce the time to change and test them. This reduction in information from the 
programmer, coupled with die elimination of a link-editing or binding phase, places many 
demands on the execution of the program, which leads to die issues we address in this 
dissertation. In contrast, the Cedar system relies on a strongly-typed language which makes 
data types and module interfaces explicit These features enhance the comprehensibility and 
maintainability of large systems and allow the compiler to generate more efficient code. It 
would seem that of the ideas presented herein, only the storage management algorithms 
would be important with respect to an implementation of Ceder. 

This research centers on one EPE in particular, die Smalltalk-80 system. Although 
other EPEs share some of its features, we will henceforth concentrate on Smalltalk. Over a 


decade ago, a small band of adventurers at Xerox PARC set out to explore how computa¬ 
tional resources could help people master the programming process. The Smalltalk-80 sys¬ 
tem [GoR83, Gol81,Gol84, Kra83] is their latest achievement. We have taken a simple 
architecture and added a few features, resulting in a simple machine whose improved 
cost-performance could make the Smalltalk-80 system available to many more people. 


2.1.1. Object-Oriented Programming 

The Smalltalk systems introduced object-oriented programming, which provides 
abstractions for structuring programs and reduces the code that must be written. 
Object-oriented programming in Smalltalk-80 has three important aspects: 


• First there are no type declarations in Smalltalk-80. Instead information is kept at run¬ 
time to resolve a variable's type. A variable may take on many different types. 










• Second, a Smalltalk-80 procedure call uses the type of the first argument to choose its 
target routine. The first parameter of every subroutine has an associated type, and the 
subroutines are grouped accordingly. When a Smalltalk-80 system performs a call, it 
finds the routine associated with die type of die call's first argument As mentioned 
above, the type is not known in advance, so this search must occur at runtime. This 
overloaded call also makes it easier to reuse an old routine with a new type. When the 
old routine uses the new type, operations defined on that type will be chosen at 
run-time. It is not evea necessary to recompile the old routine. In other words, new 
types can be added gracefully to the system. 

• Finally, types can be defined as extensions of other types. To define a new type that is 
similar to an old one, the programmer can give the differences, and the new type will 
inherit die format and functions from the old one. 

The Smalltalk-80 implementation has two more features that help its programmers. 
For one thing, it runs on a computer dedicated to one user. Freedom from competing 
demands lets the system provide uniform, fast response time in order to enhance produc¬ 
tivity. The other feature is automatic storage reclamation. Programmers of early 
list-manipulation systems found it cumbersome to free unused storage explicitly. Instead, 
they found ways to let the run-time support software reclaim unused storage automatically 
[McC60, C 0 I 6 O]. Automatic reclamation provided a very important benefit: eliminating 
errors caused by releasing storage too early. Despite its advantages, the high overhead asso¬ 
ciated with automatic storage reclamation prevented widespread acceptance. This barrier 
has been removed by faster algorithms. 

Hi. Shortening the Edit-Compile-Test-Debug Cycle 

In addition to reducing editing time, the Smalltalk-80 system reduces the time for the 
compile, test, and debug phases of software construction. Conventional systems require a 



lot of time to rebuild a large program after a change. The Smalltalk-80 system uses incre¬ 
mental compilation and dynamic linking to integrate changes rapidly. 

• Incremental compilation. To reduce the work needed to incorporate a small textual 
change, a system must avoid recompiling the whole program. Information in symbol 
tables or parse trees must be maintained and reused for the portion that did not change. 
Most systems supply separate compilation on a module-by-module basis. Recompila¬ 
tion frequently takes ten seconds to a minute. The SmalltaDc-80 system provides a 
much finer grain of incremental compilation and much shorter response times. Magpie 
is a similar EPE for PASCAL [DMS84]. It compiles after every keystroke. In this 
system, there is rarely a perceptible delay to rebuild a program. 

• Dynamic linking. In a system that does all linking before execution starts, the pro¬ 
grammer must wait a while longer after recompiling a module while die system relinks 
the module to the program’s other modules. The result is that a simple change to a 
large program takes a long time. In systems like Smalltalk-80, modules are not stati¬ 
cally bound together. Instead, they are connected as needed, dynamically. Dynamic 
linking is essential to maintain short response time for changing large programs. 

• Source-level debugging. Although most programmers construct their programs in a 
high-level language, early systems forced them to debug their programs in terms of 
machine instructions and machine data types. Modern systems make debugging easier 
by presenting breakpoints, errors, and variables in terms of the HLL source code 
instead of the object code. For instance, they show where execution is suspended in 
the source code and can execute a line at a time. In such systems, the programmer can 
debug much faster because he has less work to do. EPEs go even further. When 
debugging, the programmer can try the effect of a new statement by merely typing it 
in. The Smalltalk-80 system will instantly compile and execute the statement in the 
context of the suspended program. When the error is located, it can be corrected 



























without terminating the suspended program. It can be restarted, or single-stepped from 
the point of the error. With a system like Smalltalk-80, one can debug a program into 
existence. 

The Smalltalk-80 system represents a compromise between compiled and interpreted 
systems. Programmers can produce more software when they can incorporate and test 
changes faster and when they can take advantage of a powerful debugger. Most such sys¬ 
tems are interpreters, saving much state and interpreting it at runtime. Of course, the extra 
work involved imposes severe performance penalties. To run die fastest, a program must do 
the least work; compilers attempt to determine as much as possible about a program's 
behavior statically leaving a minimum of work for runtime. The Smalltalk-80 system is a 
happy medium. Enough information is compiled out to make good performance possible, 
but enough is left in to make it easier to program. 

1U. Graphics 

The Smalltalk-80 system takes advantage of bitmap display hardware and pointing 
devices to support multiple windows, selecting by pointing, pop-up menus, even diagrams of 
program structure [ShM83]. This follows the adage that "A picture is worth a thousand 
words.” 

2.1.4. Rapid Response 

High productivity demands consistent, split-second response time [Tha81]. So, most 
EPEs we know of use dedicated personal, high-performance minicomputers. 

2.1.5. The Bad News 

Why do exploratory computing environments remain largely experimental? They 
suffer from poor cost-performance. For example, each of the EPEs in Table 2.1 requires a 
powerful and costly minicomputer for each programmer. The research in this dissertation is 



an attempt to reduce die hardware cost for the Smalltalk-80 exploratory programming 


environment 


22. The Smalltalk-80 Exploratory Programming Environment 

In 1972 Alan Kay started a group at Xerox PARC to explore how computational 
resources could help people master the programming process. The Smalltalk-80 system 
[GoR83, G 0 I 8 I, Gol84, Kra83] is the culmination of their efforts. A dedicated, powerful per¬ 
sonal computer hosts this innovative system. Multiple on-screen windows, pop-up menus, 
and pointing distinguish Smalltalk-80’s user interface from older systems. The Smalltalk-80 
language has replaced operating on variables with sending messages to objects, and its 
run-time system automatically reclaims storage and finds space to allocate new objects. 

Smalltalk-80’s greatest strengths and its worst weaknesses result from the same design 
decision, dynamic binding of types to variables and subroutines to call instructions. 
Smalltalk-80's designers have eliminated type declarations from the language, thereby mak¬ 
ing it easier to write and modify programs. 

On the other hand, computing a variable’s type or a call's destination on-the-fly slows 
down the system, or increases die cost for a machine with adequate performance. The only 
computer that has demonstrated universally acceptable Smalltalk-80 performance is the 
Xerox Dorado [LPM81,Pie83,Deu83a]. This 70 ns ECL minicomputer costs $120,000 (in 
1983) and dissipates over 2 kilowatts, requiring an air-conditioned room. Smalltalk-80 sys¬ 
tems that run on more conventional, cheaper computers, including our own Berkeley 


Table 2.1: Some exploratory programming environments. 


Environment Language Developed at Host CPU C 


InterLisp-D InterLisp Xerox PARC Dorado 

Cedar Cedar-Mesa Xerox PARC Dorado 

Smalltalk-80 Smalltalk-80 Xerox PARC Dorado 

Lisp Machine ZetaLisp Symbolics Symbolics 3600 













































Smalltalk, suffer lackluster performance. For example. Table 2.2 shows the performance of 
die official Smalltalk-80 compiler benchmark for several implementations, including a simu¬ 
lation of our machine. (See Section 4.1 for a description of the benchmarks.) 


2 J. Reducing the Cost of EPEs with Software Only 

How can we make Exploratory Programming Environments more cost effective and 
more generally available? One way is with clever software on a cheap, conventional 
machine. L. Peter Deutsch and Alan Schiffman have built such a Smalltalk-80 system for a 
10 Mhz Motorola 68010 micropro ce ssor [DeS84], a conventional (and successful) general 
purpose microprocessor. The 68010’s microcoded control unit implements a 32-bit, 
register-based instruction set that tuns at memory speed. Jumps pay a penalty to refill the 
instruction pipeline, and calls must contend with register saving and restoring overhead. A 
large flat address space helps support systems like Smalltalk and Lisp that require large, sin¬ 
gle address spaces. 


Although the fastest 68010 instruction is 6 times slower than a Dorado microinstruc¬ 
tion, the Deutsch-Schiffman system runs Smalltalk-80 only three times slower * The 


Tabic 12: Performance of Smalltalk-80 Compiler Benchmark. 


Machine 


Dorado Dolphin 
(Xerox) ' (Xerox) 


VAX-11/780 
(DEO 


68010 

(Xerox) 


SOAR I 
(UCB) 


! Year of introduction 


1978 


1978 


1978 


1984 


1985 


Technology 


ECL 


TTL 


TTL 


NMOS NMOS 


Cycle time 


67 ns 180 ns 


200 ns 


400 ns 400 ns 


Virtual machine 
implementation 


microcode 


assembler 


Object pointer size 


16 bits 


32 bits 


Relative Performance: Dorado ■ 100%, larger is faster 


( 100 %) 


11 % 


8 % 


40% 


103% 


* Tho lynn ha* now bm ported to the MC61020, at i SUN 3 workstation. Thu processor m a 16.67 Mhz. with 
wait Mata* (SSSSS). The fastest possible instruction real is three clock cycles, or ISO at. The memory system can deliver s 
32-bit word in 270 os. So, the cycle time for e simple instruction would seem to ran|t from IK ns to 270 ns. depending on 
whether the mstrecuoo is cached. On this machine, the Xerox 6*000 Smalltalk system can execute the compiler benchmark 
I0« as fast ae a Dorado. 



































efficiency improvement over the Dorado arises from the following software techniques: 

• Dynamic translation. Instead of being interpreted, Smalltalk-80 subroutines are 
translated into 68010 instructions when first called. The translated versions are 
directly executed and then cached for later use. 

• In-line caching. Each procedure call requires a table lookup to find its target subrou¬ 
tine. Even though a call could invoke many possible targets, there is a simple way to 
predict the target of any given call. 95% of die time, a call will invoke the same rou¬ 
tine it did die last time [DAmb83]. Thus, after performing a lookup for a call instruc¬ 
tion, die Deutsch-Schiffman system overwrites the call to the lookup routine with a 
call to die target routine. The next time the call is executed, control bypasses the 
lookup routine and goes directly to the previous target Of course, die other 5% of the 
time, the target has changed. So, each subroutine starts with a check to cause another 
lookup if necessary. In this manner, the targets for subroutine calls are cached in the 
instruction stream, eliminating cosdy lookups. 

• Volatile contexts. The Smalltalk-80 language specifies that its activation records can 
be manipulated like any other objects in the system. Although this simplifies the 
debugger, it creates more work for calls and returns and thus hurts system perfor¬ 
mance. For example, when saving the program counter, a call must first convert it 
from a pointer into a tagged integer offset. Deutsch and Schiffrnan have minimized the 
overhead by providing multiple representations for activation records and automatic 
conversion between them. In this manner, they defer expensive conversions as long as 
possible. Since very few activation records are ever examined by the debugger, most 
of these conversions are never performed at all, significantly reducing subroutine call 


overhead. 





• Deutsch-Bobrcrw deferred reference-counting. In addition to activation records, a 
Smalltalk-80 system allocates a new object every 80 instructions on average [Ung84]. 
This heavy burden can make automatic storage reclamation a system bottleneck. In 
this system, Deutsch-Bobrow deferred reference-counting [DeB76] reduces storage 
reclamation overhead to 9% of the total CPU time. 

24. Hardware for Exploratory Programming Environments 

In addition to innovative software, special-purpose hardware may further reduce die 
cost of an EPE. In the past, researchers have closely coupled die source language semantics 
to the hardware-supported operations and data types. Although memory-efficient, this 
approach has usually resulted in increased cost and poor performance. This section exam¬ 
ines five computers: the RICE computer, which introduced tags, the Burroughs 5700, 
Scheme-79, and Symbolics 3600 machines designed for specific high level languages, and 
the Katana-32, another mic rop ro c essor for die Smalltalk-80 system. 

2.4.1. The RICE Computer 

The R-2 computer developed at Rice University was a tagged architecture with sub¬ 
script address calculation and bounds-cbecking hardware [Feu72]: 

• A wide, 62-bit word size allowed an array’s length and initial index to accompany its 
base address. 

• A rich variety of numeric types, control words, and address words were encoded in the 
R-2’s four tag bits. (See Table 2.3.) 

The R-2 design simplified its compilers, provided a measure of protection for the operating 
system, and reduced the amount of data needed by the debugger. Although it did not max¬ 
imize spac*. this design fostered sharing among many users in a common address space. To 
our knowledge, the RICE computer was the first to add tags to data. 







nrrmirt VTJtrv vt 


1112: 4 


H« 


Length 


Present is Cote 


Initial Index 


| Indirect tags 
Restricted access 
Direct tags 

Software tags (trace bits) 
Vrite lockout 


Figure 2.J: R-2 address word format. The length and index of the first element accompany 
the base address. 


Tag _ Meaning _ 

0000 mixed or untagged 
0001 (unassigned) 

0010 (unassigned) 

0011 (unassigned) 

0100 real, single precision 
0101 54-bit binary string or integer 
0110 double precision 
0111 complex 

1000 undefined for normal operations 

1001 partition word 

1010 relative control word 

1011 absolute control word 

1100 relative address, unchained 

1101 absolute address, unchained 

1110 relative address, chained 

1111 absolute address, chained _ 

2.42. The Burroughs B5700 and B6700 Computers 

In the sixties and early seventies, the Burroughs Corporation introduced the first com¬ 
mercial computers dedicated to a high-level-language, their 5000 and 6000 series [Org73]. 
A tagged, stack-oriented architecture was chosen to host an Algol superset. Memory was at 
a premium in those days, and its segmented virtual memory system enabled the B5700 to 
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operate with only 32,000 words of main memory. Paradoxically, adding 3 tag bits to each 
45-bit memory word saved memory by reducing the number of words needed. For example, 
tags on data reduced die size of instructions by permitting a single add opcode to serve all 
types of numbers. Tags also helped with managing the stack and accessing data structures. 
Table 2.4 illustrates the 6700's data formats. A substantial quantity of hardware in these 
was devoted to supporting stack-based, block structured computation. The 5700 
and 6700 proved that commercial computers could be designed for a high level language. 


2.4J. Scheme-79 

Scheme-79, an early high-level language microprocessor, directly executed a dialect 
of Lisp [SHJ81]. 

• Each 32-bit word contained one bit to aid garbage collection, seven bits of type and 
opcode information, and a 24-bit pointer. (See Figure 2.2.) 

• An innovative and interesting design. Scheme’79 pushed Lisp abstractions to a low 
level to attain the power of interpreted execution at lower cost. For example, many 
opcodes were needed to maintain the correspondence with source-level Lisp 

I Table 2.4: Burroughs 6700 data formats! - ] 


Class of Operand 

Type of Word 

T«g 

1 numbers 

single-precision 

000 

double-precision (2 words) 

010 

descriptor words 

segment 

Oil 

data 

101 


control words 

indirect reference word 001 

stuffed indirect reference word 001 
mark stack control word 011 

return control word 011 

top-of-stack control word 011 

program control word 111 




GC type 


Figure 22: Scheme-79 data format. Two Of these words mike op a list node. 

primitives. (See Table 2.5.) As a result, microcode, microsubroutines, and nanocode 
woe used to fit the control circuitry on-chip. Schexne’79 had good performance com¬ 
pared to other in t e r pr e t e rs, but not when compared to compiled Lisp. This is shown in 
Table 2.6, from [Pon83a]. These data suggest that a machine that is specialized for a 
particular system must also exploit compilation to attain high performance. 


Instead of a linear sequence of instructions, Scheme-79 used a Lisp binary tree for pro¬ 
gram control, each node consisting of two words. T**e first word was the instruction 
and the second was a pointer to the next instruction. The instruction format is the same 


Table 2.5; Soma Scheme-79 opcodes. 
APPLY 
CAR 
CDR 

CLOSURE 

COND 

CONS 

EQ 

FIRST-ARC 

GLOBAL 

LIST 

LOCAL 

NIL 

PROCEDURE 

_SEQUENCE_ 


Table 2.6: Performance of the Scheme benchmark. 


VAX 11/780 Franz interpreter 
Scheme chip (projected) 

VAX 11/780 Franz, complied (normal funcall) 
VAX 11/780 Franz, compiled (local funcall) 


2 min 
1 min 
8.7 sec 

3 sec 


































as the data format sbowo above. This non-sequential format prohibits instruction pre¬ 
fetching and so reduces the speed of macro-instructions. 

• All data, including the stack contents, were kept in memory as lists. In addition the 
memory reference overhead, this approach wasted time to reclaim list space for tem¬ 
porary values. Even with a microcoded link-reversal mark-and-sweep garbage collec¬ 
tor [ScW67, StaSO], Suss man estimated that Scheme would spend 80% of its time in 
the storage allocator. 

The Scheme-79 chip was fabricated in the MPC-79 Multi-University Multiproject 
Chip-Set at X « 2JL p (5 micron line width). It was 7500 p long and 5900 p wide. One of 
the fabricated chips ran small programs and reclaimed storage. Fibonacci(20) took 100 mil¬ 
lion cycles (@ 1600 ns) with a 64KW memory that was half-full. Over two-thirds of those 
cycles were spent collecting garbage. Scheme-81 is a successor to Scheme-79 with more 
aggressive silicon technology (X - 1.5,12,000p w x 12,000p h) [BGH82], Its designers esti¬ 
mate Scheme-81 would run five times faster than Scheme-79. This would still run the 
Scheme benchmark more slowly than compiled Franz Lisp on a VAX 11/780. 


datatype 
CDR code 


immediate number 













2.4.4. The Symbolics 3600 Lisp Machine 


The Symbolics 3600 is a Til. personal minicomputer for Lisp [Roa83, Moo85], It has 

good performance, substantial complexity, and high cost — $80,000 for each programmer. 

• Each word contains 36 bits: a two bit field for list compression (CDR-coding), a type 
field of two bits for numbers or six bits for pointers, and either a 32-bit data field or a 
28-bit pointer field. This provides a rich selection of hardware-supported types. Table 
2.7 lists some of the 34 types implemented by the 3600’s hardware and firmware. 

• Each 3600 instruction is 17 bits long, with nine bits of opcode and eight for the 
operand/address. There are seven instruction formats. Table 2.8 gives a sampling of 
the opcodes. 

• Some of the 3600’s instructions perform complex operations. Instructions such as 
multiply, divide, and store-array-leader may take many cycles to complete. These 
instructions must also handle many different data-types. These factors combine to 
require almost a million bits of control store, about twice that of a VAX-11/780. 

• Tags in die 3600 minimize die cost of dynamic typing. In conventional systems, a 
datum’s type must be determined before it is used. A 3600 instruction assumes a 


! Table 2.7: Some Symbolics 3600 data types. 
ARRAY 

| BIGNUM 

CLOSURE 
COMPILED CODE 
COMPLEX NUMBER 
COROUTINE 

EXTENDED FLOATING POINT NUMBER 
FLAVOR-INSTANCE 
FLOAT 

. LEXICAL CLOSURE 
LIST 
NIL 

RATIONAL NUMBER 
SYMBOL 

















Table 14: Some 3600 opcodes. 


Examples 

Data movement 

pusb-immed 

pop-n-save 

movem-local 

Instance variable 

pusfa-uistance-variable 
mo vein-instance-variable 
instance-ref 

Function calling 

call-O-stack 

call-n-return 


funcall-1 -stack 

Binding and function entry 

take-n-args 


take-n-optional-args-rest 

Function return 

return-stack 


return-multiple 

Quick function call and return 

_ P°Pj_ 

Branch 


Catch 

catch-open-stack 

unwind-protect-open 

Predicates 

«q 

not 

fixp 

floatp 

symbolp 

arrayp 

Arithmetic 

add-stack 

subtract-stack 

multiply-stack 

quotient-stack 

remainder-stack 

rot-stack 

List and symbol 

car 

cdr 

iplaca 

set 

symeval 

property-cell-location 
package-cell-location 

Array 

array-leader 

stoie-anay-leader 

Subprimitive 

halt 

%multiply-double 
%data-type 
% pointer 

%stack-group-switch 

%gc-tag-tead 


» 





















likely type and proceeds, while simultaneously verifying that assumption against the 
tag. If the assumption is false, the 3600 aborts the current microcode sequence and 
starts executing microcode for the required operation. This saves time for operations 
on the most common types. 

• An area-based automatic storage reclamation algorithm reclaims space by incremen¬ 
tally copying surviving objects. The Symbolics machine has paged virtual memory 
and its paging hardware aids storage reclamation by recording which pages of per¬ 
manent objects contain references to temporary objects. Area-based copying reclama¬ 
tion is very efficient (See the chapter on automatic storage reclamation.) 

• The 3600’s microcycle rime varies between 180 and 2S0 ns, making it one of the 
fastest commercially available personal computers for an exploratory programming 
environment [Pon83b]. 

Although providing good performance, die 3600’s $80,000 price tag reflects the cost of seek¬ 
ing hardware solutions to system problems. 

2.45. Katana-32 

Midway through the SOAR project we learned of the Katana-32, also known as 
Sword-32, an independent attempt by a group of researchers at Tokyo University, to build a 
fast VLSI Smalltalk-80 microcomputer [SKA84. Suz84]. Unlike our RISC approach, they 
have continued with die traditional complex instruction set (CISC) style of computer archi¬ 
tecture. Table 2.9 compares the Katana and SOAR designs. Katana's large microstore, vari¬ 
able length bytecoded instructions, and 160 registers, suggest that it is basically a Dorado on 
a chip. Table 2.10 shows the benchmark used for their performance predictions, with Table 
2.11 showing the resulting object code for both machines. 

The designers of Katana-32 are relying on aggressive VLSI technology for their perfor¬ 
mance projections. Their chip will have five times more transistors than SOAR, and have 









Table 2.9: Comnarison of SOAR a.,d Katana-32. 


AR Katana-32 


architecture 
number of instructions 
instruction formats 
instruction length 
data path width 
microstore 
registers 
cycle time 

number of transistors 


testAcrivationReturn micro-benchmark* 


71 bytes : 21 bytes 


bytecode interpreter 

-46 

-9 

1 -3 bytes 
32 bits 

4Kw x 45 bits 


Table 2.10: The 4ctivatk eturn benchmark. 


malltalk- 


recur. tl recur(tl) { 

tl » 0 ifTrue:[*self]. if (tl • 0) return 
self recun tl - 1. recur(tl - 1) 

‘self recur tl - 1 recur(tl - 1) 


* Thu oat sacra-beochmert u sot i fair comparison. However, ee far as we know, it is die only Katana performance 
Sferv available. 

♦ 12 J with a better compiler. 

! 510 na is die measured cycle time of working NMOS SOAR chips. including 110 u for the unexpected jump and call 
delay fPenJJb. Prni5ej. (See Section 3.4.3.) 125 ns u the projected cycle time for Katana |SuzS4). 
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Table 2.11: TestActivationReturn object code. 


SOAK Machine Code 

cycles 

ftloadc 

(r receiver)classOffset, it 

2 

%load 

(r retumAddress)0, rS 

2 

fetrapl 

ne r5, r6 /* cache miss */ 

1 

skip 

eqr tl,0 

1-2 

jumpt 

.+2f 

It 

remw 

r retumAddress. 1 

2 

sob 

r tl, 1,16 

1 

%addt 

r6,0, rS !* synthesized move */t 

It 

%add 

r_self, 0, r6/* synthesized move */ 

1 

call 

recur 

1 

<selector> 



sab 

r tl, 1, r6 

1 

fcaddt 

rt, 0, *5 P synthesized move */t 

It 

%add 

r_self, 0, r6 /* synthesized move */ 

1 

call 

recur 

1 

%add 

r6,0, r retVal 

1 

%trap2 

geu r retval, CONTEXT TAG 

1 

remw 

rretumAddress, 1 

2 

length 


72 bytes 

: min time 


9 cycles 

max time 


19 cycles 

i * vera 8 e _ 


14 cycles 

Katana-32 Machine Code [SKA84, Suz84] 

cycles 

I pushTemp: 0 


3 

pusbConstanc 0 

2 

send:- 


3 

jumpFalse: 10 

3-6 

returaSelf 


4 

poshSelf 


2 

j pushTemp: 0 


3 

; pusbConstanc 1 

2 

i send: - 


4 

! send: recur. 


21 

: pop 


1 

; pushSelf 


2 

pushTemp: 0 


3 

pusbConstanc 1 

2 

send: - 


4 

send: recur 


21 

returnTop 


4 

length 


21 bytes 

i min time 


13 cycles 

; max time 


83 cycles 

i average time 


49 cycles 


t Tbtw fluoructioai cosld be «lin«wd by c better compiler. 


















24 


twice as many register on die datapath, yet a cycle will only take one third the time. We 
believe that could SOAR could also run considerably faster if implemented in that technol¬ 
ogy. 

23. Reduced Instruction Set Computer (RISC) Architecture 

The machines described above are more elaborate and expensive than conventional 
computers. We need a machine that has high performance at low cose One recent style of 
computer architecture, the reduced instruction set computer (RISC), claims to meet those 
demands for traditional programming systems [PaD80,PaS81,PaS82]. In this style there is 
a much closer coupling between architecture and implementation. 

To design a RISC, 

• start with a fast and simple register-based instruction set similar to microcode in other 
machines, then 

• identify die time-consuming operations in typical programs, and finally 

• take the hardware saved by simplifying instruction execution and dedicate it to speeding 
up die time consuming operations. 

RISC designs contrast with traditional high-level language computers that rely on long 
microcode sequences to provide complex functions ‘‘in hardware.” Instead of microcode, 
RISC systems rely on software to provide complicated operations. Of course, software con¬ 
sumes memory, but we would gladly add memory to gain speed. The rest of this section 
touches on several important RISCs: IBM's 801, Berkeley's RISC I and II, and Stanford’s 
MIPS. These reduced instruction set computers all point in the same direction, more perfor¬ 


mance with less hardware. 
















2.5.1. IBM-801 


The IBM-801 computer pioneered many RISC concepts [Rad82], including a simple 
load/store instruction set and the coupling of architecture design with compiler technology. 
A sophisticated graph-coloring algorithm enabled its compiler to optimize register allocation 
over a fairly small register file [Cha82]. Constructed in ECL, die 801 attained excellent per¬ 
formance. Although this work was not published immediately, it pioneered the benefits of a 
reduced instruction set 

23.2. RISC I and H 

The RISC I and II microprocessor chips were designed and built at Berkeley to yield 
high performance for die C/Unix environment [KSP83]. Figures 2.4 and 23 are photo¬ 
graphs of the RISC I and n, respectively. 

• True to their names, these reduced instruction set computers have about two dozen 
instructions in their instruction sets, and are distinguished by the simplicity and com¬ 
pactness of their control circuitry — 5% to 10% of chip area. This contrasts with 50% 
for more typical designs. The minimal and simple control circuitry shortens the design 
time as well as instruction cycle time. 

• These systems were designed for existing compiler technology. In this technology, 
subroutine calls are slow because they save and restore registers. RISC 1 and II speed 
up subroutine calls with hardware that eliminates this source of overhead. To accom¬ 
plish this, they spend the area saved by simplifying the control circuitry on a large 
on-chip register file, organized as overlapping windows. 

In addition to providing good performance, reduced instruction set computers are easier to 
design. RISC 1 met the goal of functional correctness on first silicon, and RISC II ran at full 
speed on first silicon, outperforming superminicomputers using the same compiler technol¬ 
ogy. A more complex architecture would have jeopardized these goals. 
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Figure 25: Microphotograph of RISC II. Only 5% of the chip — the upper nght comer 
is dedicated to control. 

















































2 S3. MIPS 


MIPS sands for Microprocessor without Interlocked Pipelined Stages 
[HJP83, HJB82). It refines reduced instruction set architecture by eliminating pipeline inter¬ 
lock hardware. Instead, the MIPS project has developed effective algorithms to schedule 
instructions for die pipeline statically. The results are promising: 

• Instruction dependencies are handled with a one-stage delayed branch. (The instruc¬ 
tion following a branch is always executed.) The MIPS reorganizer fills 70% of the 
slots after delayed branch instructions. Since these branches account for 20% of all 
instructions, and since MIPS has one delay slot per branch instruction, there are 20 
delay slots for every 100 instructions. Filling 70% of them leaves only 6 wasted slots 
per 100 instructions, which is only 6% slower than die (probably unrealizable) 
optimum. 

• Data dependencies are also handled by reordering instructions. The performance of 
code generated this way is within 3% of the code drat could be run with hardware pipe¬ 
line interlocks. 

• Another finding of the MIPS project is that a word-addressed machine can run most 
programs faster than one with byte addressing. The problem with byte addressing is 
that the extra circuitry required can slow down word references. 

• MIPS demonstrates impressive performance: a simulated MIPS CPU with a 4MHz 
clock runs benchmarks about five times faster than a 8Mhz 68010. 

The MIPS project blends simpler control circuitry with more sophisticated optimizing com¬ 
piler technology to achieve more performance with less hardware. 
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li Summary 

The Smalltalk-80 system provides a programming environment that boosts a 
programmer's productivity. It does so by exploiting the object metaphor to shorten the 
edit-compik-iest-debug cycle. However Smalltalk-80, along with other exploratory pro¬ 
gramming environments, nuts slowly on conventional hardware. 

We have designed a reduced instruction set computer, and added features to it to sup¬ 
port Smalltalk. In doing so, we have followed in the footsteps of other architecture projects: 

• The RICE computer pioneered tags, as a means to control data manipulations. 

• The Burroughs B5700 and B6700 computers supported Algol with tagged data, 
desc rip tor s , and a tailored instruction set 

• Scbeme-79 was die first attempt to many Mead-Conway VLSI design with an interpre¬ 
tive language. 

• The Symbolics 3600 Lisp Machine is a commercially successful computer dedicated to 
a specific exploratory programming environment 

• IBM-801 revived interest in simple computers and highly optimizing compilers for 
non-floating point applications. 

• RISC I and II at Berkeley taught us much about instruction sets, register windows, and 
dan path design. 

• The MIPS machine at Stanford encouraged us to forego byte addressing. 

SOAR combines a simple, RISC architecture, with enough tagging to support the com¬ 
mon cases. In the following chapters, we describe SOAR’s architecture, assess the worth of 
each architectural feature, explain important algorithms in its system software, and propose 
designs for future systems. 
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Chapter 3 


The SOAR Architecture 

3.1. Introduction 

This chapter describes the SOAR architecture, contrasting SOAR with its predecessor. 
RISC n. Most mnovaboos in SOAR compensate for sources of overhead in Smalltalk-80 
systems: run-time type checking, virtual machine interpretation, elaborate and frequent pro¬ 
cedure calls, and maintaining many small, dynamic data structures. We conclude with an 
overview of the implementation, detailed in Pendleton’s doctoral dissertation [PenSSb]. A 
summary of this chapter has been previously published [UBF84], A more detailed architec¬ 
tural description appears in [SKF85]. 

Two figures-of-merit accompany each feature: execution time and memory space. We 
gauge a feature’s significance by examining what would happen if we left it out. Thus an 
omission time cost at 50% means that a job requiring 100 cycles on full SOAR would take 
100 + 50, or 150 cycles without die feature. Likewise an omission space cost of 33% indi¬ 
cates that die whole Smalltalk-80 system would grow by 33%, from 1.5 mB to 2.0 mB. 
With these metrics, we can find the combined impact of removing two independent features 
simply by adding the omission costs for each. These data are the results of simulations and 
assume no radical compiler changes. (The derivation of the numbers is explained in die next 
chapter and in Appendix A.) 

3J. Type Checking 

The FORTRAN statement “I » J + K” denotes integer addition, and can be performed 
with a single add instruction. But. since Smalltalk-80 has no type declarations, J and K may 
hold values of any type, from booleans to B-cees. Thus, every time a Smalltalk-80 system 
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evaluates *'J + K”, it most first check the types and then perform the appropriate operation. 
Measurements of convention*] Smalltalk-80 systems show that over 90% of the “+” opera¬ 
tions do the simplest possible operation, integer addition [Bla83c]. Since a type check takes 
at least as long as an add instruction, most Smalltalk-80 systems waste a lot of time checking 
types for integer arithmetic. 

3.2.1. Tags Trap Bad Guesses 

The purpose of data tags in SOAR is to improve performance, not to discover program 
errors as in the R-2 and B6700. SOAR's instruction set follows other Smalltalk-80 imple¬ 
mentations in having only two types of tagged data: integers and pointers [GoR83]. In 
SOAR, the high-order bit of each word distinguishes these two types. For arithmetic and 
comparison operations, SOAR assumes that the operands are integers and begins the opera¬ 
tion immediately, simultaneously checking the tags to confirm the guess. Most often 
(>92%, Table A.4) both operands are integers and the correct result is available after one 
cycle. If not, SOAR aborts the operation and traps to routines that cany out the appropriate 
computation for the data types. Figure 3.1 shows the SOAR tags. This feature is very 
i m portant; without it, SOAR would run 26% slower and require 15% more memory (Tables 
A.7 and A.8). SOAR is die only Smalltalk-80 system that overlaps these operations. Every 
other Smalltalk-80 system incurs a time penalty for serial tag checking. It would be very 
difficult for an optimizing compiler eliminate these checks in the absence of type declara¬ 
tions. 

3JL2. Conditional Skip Instructions 

Although condition codes have been widely used to decouple a test from a branch, they 
are awkward for a Smalltalk system. Instead of condition codes, SOAR has 
compare-and-skip instructions that quickly perform integer comparisons. Remember that 
Smalltalk has dynamic type binding. Thus, in SOAR, “i < j” must be computed with an 















3L-bn 2|s complement integer 


fonnat of integer data 


IN 


28-bit word address 


format of pointer data 


Figure 3J: SOAR tagged data types. SOAR supports two data types, 31-bit signed in¬ 
tegers and 28-bit pointers. Pointers include a generation tag (as explained in Section 3.5.1). 
SOAR words could have contained 32 bits of data plus one bit of tag for a total of 33 bits. 

The scarcity of 33-bit tape drives, disk drives, sad memory boards led us to shorten our 
words to a total of 32 bits including the tag (31 bits of data). 

instruction that checks the tags of i and j as it compares them. If die condition holds, there is 
a one cycle penalty for skipping an instruction. If the condition fails, the instruction follow¬ 
ing the skip is executed. This is usually a jump. What if one of the operands is not an 
integer? A trap to the appropriate comparison software will be taken. In a condition code 
architecture, this software (e.g. the floating point compare routine) would have to set the 
condition codes to reflect die result In SOAR, all it must do is return to the next instruction 
or the one after that, a simpler and faster operation. 

Separating a conditional jump into a conditional skip and unconditional jump does not 
impose a significant performance penalty. SOAR jump instructions contain the absolute 
address of the target instruction. Because no address computation is required, SOAR elim¬ 
inates the instruction prefetch penalty for jumps (see Fast Shuffle in Section 3.4). Thus, a 
conditional branch can be simulated in two cycles, one for the skip and one for die jump. 
The only way to speed up conditional branches would be to add a one cycle 
compare-and-branch instruction to SOAR. Such an instruction would require the addition 
of a separate adder to compute the branch target address in parallel with the comparison 
operation. Worse, it would only speed up SOAR by 3%, which would not justify the addi¬ 
tional hardware. (See Section A.2.2.) 
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313. Two-Tone Instructions 

A tigged architecture that licks microcode most include instructions that manipulate 
and inspect tags. Because the Smalltalk system already relies on the compiler to ensure sys¬ 
tem integrity, we can allow die compiler to mix instructions that manipulate tags with 
instructions that are constrained by tags. Each SOAR instruction contains a bit that either 
enables or disables tag checking. Untagged mode (indicated by a % in the assembly 
language) nuns off all tag checking and operates on raw 32-bit data. In untagged mode die 
tag bits are treated as data, and the complete instruction set can be used to manipulate this 
data. Untagged instructions also allow programs written in conventional languages such as 
C and Pascal to run on SOAR. Instead of providing two versions of each instruction, we 
could have defined a mode bit in the PSW. This would have been very expensive, increasing 
execution time by 16% and memory usage by 19% (Tables A.11 and A.12). 

32.4. Tagged Immediate Operands 

SOAR’s immediate format has been designed to accommodate tagged data. The 
high-order four bits of die 12-bit field becomes the tag bits of the operand, the low order 
seven bits of die immediate field form the low order seven bits of the operand, and die eighth 
bit is sign-extended to fill in the bits in die middle (see Figure 3.2). Thus, any tagged value 
between -128 and 127 can be represented as shown in Table 3.1. This saves time by allow¬ 
ing the Smalltalk-80 software to encode some important tagged values as immediate 
operands. Of course, there is no such thing as a free lunch. Reserving four tag bits severely 
curtails die range of addresses and offsets from -2048-2047 to -128-127. However, this 
representation optimizes the more frequent case and improves performance by 10% (Table 
A.15). 


















Table 3.1: Useful immediate values. 


Immediate Field ' Ex 


from to 


-bit Integers 


FFF j F FF FF F80 FF F FFFFF 
07F ■ 00000000 0000007F 


31-bit Integers 


7FF F FF F F 

0000007F 


Pointers to Frequently Referenda Objects 
(includes nil, true, and false) 


BOO B7F i B0000000 B000007F 


Values for Testing Tags of Pointers 


r^il 


-128 

-1 

0 

127 



assistant generaoon 
associate generation 
full generation 
emeritus generation 
activation record 























































Smalltalk can be transported to a new machine by writing only the virtual machine emula¬ 


tor. 

This approach has drawbacks too: 

• Decoding such d en se instructions takes either substantial hardware or substantial time. 
For example, the Dorado Instruction Fetch Unit consumes 20% of die CPU [Pie83], and in 
Berkeley Smalltalk, decoding a simple bytecode takes twice as long as executing it 

• Some of die high-level instructions require many microcycles to execute. These multicy¬ 
cle instructions must be sequenced by a dedicated control unit 

3.3.1. Reduced Instruction Set 

Following die reduced instruction set approach, we abandoned the Smalltalk virtual 
machine instruction set and designed die SOAR instruction set from scratch to minimize the 
time and hardware needed to decode and execute instructions. SOAR instructions therefore 
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resemble microinstructions. Although such an instruction set results in larger object code, 
we believe that the cost of 500 KB of additional main memory is offset by an approximate 
doubling in speed. 

Each SOAR instruction occupies a 32-bit word, and most instructions take one cycle. 
The only exceptions are loads, stores, and returns, which take two cycles. The uniform 
length and duration of instructions simplify instruction prefetch. Figure 3.3 shows instruc¬ 
tion formats. 

SOAR departs from RISC II by omitting byte-addressing. Instead, separate insnuc- 
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lions insert or extract bytes from words. Unlike systems for other languages such as C. 
Smalltalk-80 systems do not support scalar data types that occupy a single byte. (The sys¬ 
tem software uses bytes to pack fields into the object header.) Processors with 



byte-addressing incur a time penalty due to the alignment logic. Even if no penalty 
occurred, adding byte addressing would only improve performance by 7% (Table A. 17). On 









































































































































If the object is a tagged integer, its type must be supplied by * trap handler. Dedicating an 
opcode to this function saves time in the trap handler. Likewise die sll instruction allows a 
tag trap to be treated differently according to whether addition or shifting was intended. 
Neither of these cloned instructions is very important The loadc instruction realizes only a 
0.5% performance improvement (Table A.18). We believe that die sll instruction would not 
improve performance much either. Since the compiler used for these studies did not go to 
the trouble to generate it we could not measure the frequency of this instruction. 

Two glaring omissions from SOAR are a barrel shifter for single-cycle, multiple-bit 
shifts and support for integer multiplication and division. Although multiple-bit shifts may 
be important for driving die bitmapped display, they would speed up normal Smalltalk-80 
programs by less than 0.4% (Table A. 19). Likewise, instantaneous multiplicadon and divi¬ 
sion would shave only 3% off of our benchmark times (Table A.20). 

One drawback of SOAR's reduced instruction set is the increased time for compilation. 
Bush has written a convener in Smalltalk that translates bytecodes to SOAR instructions 
[Bus85]. He reports that, running on a Dorado, the mean time to convert a subroutine is 50 
ms, and that “Subjectively, the converter does not intrude on interactive system use. . 
The extra time needed to compile to SOAR instructions does not seem to pose a problem. 

More significandy, SOAR’s simple instruction set enlarges compiled code. Experi¬ 
ence with Hilfinger’s Slapdash SOAR compiler suggests that on the average, one bytecode 
results in one 32-bit SOAR instruction. Thus, ignoring data objects, object headers, and 
literal data within subroutines, there is a fourfold code expansion. However, bytecodes con¬ 
stitute only about one eighth of a 32-bit Smalltalk-80 image, and the net increase is only 0.5 
MB over die original 1 MB. This is not an exorbitant price to pay given current memory 


technology. 






























Other compiled Smalltalk-80 systems also pay this price. The Xerox 68010 system 
devotes 0.2S MB to a cache of compiled code [DeS84], Deutsch reports that one byte ode 
results is six bytes of MC68010 instructions, which is worse than the factor of 4 for SOAR 
[Deu85]. This means that if it were to compile all of the code, as the SOAR system does, 
the Xerox 68010 system would need 0.7 MB (Table 33). 

Finally, our decision to abandon bytecodes will force us to rewrite the SmalltaIk-80 
debugger. Lee has designed a debugger for SOAR and has built a prototype in Berkeley 
Smalltalk [Lee84], He exploited the hardware organization of SOAR in the design of the 
debugger to add a conditional breakpoint facility and increase execution speed during 
debugging. 


3JJ, SOAR Interrupts and Traps 

Interrupts and traps play a larger role in SOAR than in RISC 11. Unlike C, Smalltalk 
grew in an environment with extensive, system-specific microcode. Since SOAR has no 
microcode, unusual situations must be met with a trap to a software handler. For example, 
as described above, other Smalltalk implementations check the types of arithmetic operands 
sequentially, before performing the operation. SOAR checks in parallel, trapping if the 
operands are not simple integers. These account for about half of the traps (Table A.2S). 

How valuable are conditional trap instructions? They save time and space by replacing 
a two-cycle two-instruction sequence with one single-cycle instruction. For instance, the 
prologue in each subroutine uses a conditional trap instruction that verifies die type of its 


Table 33: Space Penalty of Compilation. 


ystem_execution model _code expansion ratio memory required* 


Berkeley Smalltalk bytecode interpreter 1 1.0 

Xerox 68010 cache of compiled code 6 1.3 

SOAR compiles everything 4 13 

hypothetical 68010 compiles everything 6 1.7 MB 






























first argument This saves a cycle over a skip and branch in die common case. Trap instruc¬ 
tions also support type checking in low-level primitive routines, and tag checking for 
automatic storage reclamation. However, if the trap instruction traps, it takes more time to 
handle die trap than the jump from a skip-and-jump sequence. In fact trap instructions 
account for 10% of the traps (Table A.25). Despite all these uses, the savings from trap 
instructions does not add up to much; SOAR would run only 4% slower and require only 2% 
more memoty without them (Tables A.23 and A.24). The fact that trap instructions save lit¬ 
tle time results more from the low frequency of trap instructions than from the penalty asso¬ 
ciated with taking die traps. 

The remaining source of traps also arises in RISC II. A call or return that exceeds the 
on-chip register window capacity must trap to a routine to save or restore a set of registers. 
This accounts few the remaining 40% of die traps (Table A.25). 

To reduce the cost of trapping, SOAR exploits shadow registers that catch the 
operands of the trapping instruction. These are inexpensive in single-chip processors; they 
are just two more registers on die data busses near the ALU. This feature is insignificant; 
without it, SOAR would run only 0.04% slower and require no more memory (Table A216). 
Other features that simplify trap handling include simple instructions and uniform instruc¬ 
tion size. 

SOAR does not support nested interrupts or traps because they complicate the architec¬ 
ture. The interrupt-enable bit in the PSW (Figure 3.4) is reset upon an interrupt or trap. 
Each trap handler first captures any necessary machine sate, then re-enables interrupts. 
Most handlers need their own register window to hold this sate. The normal method to 

obain a new register window would be to execute a call instruction but, since a call can 

« 

cause a trap (see above), the trap handler must simulate the call (and trap). After getting a 
new window and saving the machine sate, the handler can re-enable interrupts (and option¬ 
ally surrender its register window) with a form of the return instruction. 





















Figure 3.4: SOAR Program Status Word. The SOAR prognm status word contains a desti¬ 
nation register shadow field, an opcode shadow field, and enable bits for external and 
software interrupts. 

When an interrupt or trap occurs, the instruction that is executing is aborted before it 
can change any registers. Hie address of the aborted instruction is saved in r7. I/O inter¬ 
rupts are disabled by clearing die interrupt enable bit in die PSW. This freezes the shadow 
registers, which normally track the ALU inputs. A vector is constructed from die trap base 
register, die opcode of the aborted instruction, and die reason for the trap. Finally, control is 
transferred to die vectored location. Table 3.4 lists the various categories of traps, with 
interrupt priority listed from highest to lowest. 

Many instructions can trap for several reasons at once. To simplify the interface to the 
trap handler code, the reasons are prioritized. After handling a trap, the offending instruc¬ 
tion is typically reexecuted to spring any remaining naps. Table 3.5 shows which reasons 
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Table 3.4: SOAR traps and interrupts. 


Vector Pri Class Explanation 


I<31»1 or I<28:23> - urused 


Tag Trap (TT) 

Software Interrupt (SWI) 

1 

2 

B 

B 

! See (SKF85). 

! k30£9> > 01 and psw<5> - 1 

Window Overflow (WO) 

3 

C 

, 1 • call and cwp<6:4> - 1 ■swp<6:4> 

Window Underflow (WU) 

4 

C 

1 1 • ret and cwp<6:4> ♦ 1 - swp<6:4> 

Data Page Fault (DPF) 

5 

c 

page fault pin awaited dunag 
data memory access 

Trap Instruction (TI) 

6 

c 

1 - trap instructs oc A condition is sue 

Generation Scavenging (GS) 

7 

D 

See [SKF85]. 

Instruction Page Fault (1PF) 

8 

E 

page fault pin aaaened during 
(•fetch of oext instruction 

Input/Output (lO) 
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VO interrupt pm asserted during 

1-fetch of next instruction 
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apply to which instructions. If instead of vectoring. SOAR put the reason for the trap in a 
special register the system would be only 3% slower (Table A.28). 

When SOAR does trap, it expends two extra cycles to flush the pipeline. A one-cycle 
trap, while feasible, would have significantly degraded die cycle time [Pen85b). Since die 
extra trap cycle increased die number of cycles by less than one percent, the net result was a 
faster system. 

3.4. Fast Calls 
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The Smalltalk-80 system stresses program modularity, but omits macros because they 
would make it harder to incorporate changes quickly. If the user changed a macro, the sys¬ 
tem would have to recompile all of die modules diet instantiated it This would make it 
more difficult to maintain die split-second response time that is crucial to highly productive 
programming. Instead, Smalltalk-80 programs are broken up into many small subroutines. 
Consequently, Smalltalk-80 systems execute a higher percentage of call instructions than 
most other systems. In addition to being frequent calls are also expensive because: 

• To aid program debugging, Smalltalk-80 initializes all local variables on each call. 
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• A consequence of Smalltalk-80's power is that the destination of a call is recomputed 
from the type of the first argument with a table lookup each time the call is executed. 


Table 3.5: Trap reasons by instruction category. 
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B 

c— 
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Call 

ILL 

swi 

wo 


IFF 

Jump 

ILL 

swi 



DPF 

Return 

ELL 


wu 

cs 

IPF 

ALU 

ILL 

TT 



IPF 

Skip 

ILL 

TT 



IPF 

Trap 

ILL 

TT 

TI 


IPF 

Shift 

ILL 

TT 



IPF 

Load 

ILL 

TT 

DPF 


IPF 

Store 

ILL 

TT 

DPF 

GS 
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Byte 
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The result is that many Smalltalk implementations (including Berkeley Smalltalk and 
Dorado Smalltalk) spend about half of their time on calls and returns [Deu81]. SOAR 
reduces the Smalltalk call/return overhead in several ways. 

3.4.1. Multiple Overlapping On-Chip Register Windows 

SOAR, like RISC I, optimizes subroutine calls and returns by providing a large, 
on-chip register file. The registers are divided up into overlapping windows. Instead of sav¬ 
ing or restoring registers, calls or returns merely switch windows (Figure 3.5). Compared to 
C language subroutines, the shorter Smalltalk subroutines pass fewer operands and use fewer 
local variables, and so need fewer registers. For this reason, each SOAR register window 
has eight registers instead of 12 for RISC 1. Figures 3.6 and 3.7 show the register organiza¬ 
tion of SOAR. In addition to 56 more registers, the inclusion of register windows results in 
the addition of a register to select the current window (the Current Window Pointer, or cwp), 
a register to detect overflows by recording the last saved window (Saved Window Pointer, or 
swp). more elaborate register decoders, and trapping logic [Pen85b], Despite die cost of all 
the added hardware, Smalltalk-80’s predilection for procedure calls makes this feature very 
important The cost of saving and restoring a conventional register file would slow the 
machine down by 46%, even with load- and store-multiple instructions (Table A.29). 



Physical Registers Logical Registers 



Figure 3.6: SOAR'z register windows. Like RISC I. SOAR has many physical sea of le¬ 
gmen that nap to the logical registcn seen by each subroutine. 



Figure 3.7: Logical view of register file. The HIGHs hold incoming parameters and local 
variables. The LOWs are for outgoing arguments. The SPECIALS include the PSW and a 
register that always contains zero. The GLOBALs are for system software such as trap 
handlers. 







When the number of activations oo die stack exceeds die on-chip register capacity. 
SOAR trips to a software routine that saves the contests of a set of registers in memory. 
Unlike RISC II. SOAR has load* and store-multiple instructions to speed register saving and 
restoring. These instructions can transfer eight registers in nine cycles (one instruction fetch 
and eight data accesses). Without them, the system would need eight individual instructions 
that would consume sixteen cycles (eight instruction fetches plus eight data accesses). 
Load* and store-multiple are also helpful for garbage collection, copying data, and opera¬ 
tions on bit-mapped images. These instructions have the ability to operate on 
nod-contiguous data; the increment between memory references is given by the SOURCE2 
field. In retrospect, these multi-cycle instructions added some complexity to the design, and 
the benefits — 3% of execution time and 2% of memory — may not be worth die costs 
(Tables A 33 and A.34). 

3.42 Caching Call Targets In Line 

Another way SOAR reduces subroutine overhead is by decreasing the time taken to 
find die target of a call. Once computed, the target's address is cached in die instruction 
stream for subsequent use, as suggested by Schiffman and Deutscb [DeS84]. Figures 3.8 
and 3.9 illustrate this idea. This in-line caching exacts a price for its time savings; SOAR 
must support non-reentrant code. Since all Smalltalk processes share die same address space, 
process switches must be avoided in sections of code that modify or use the cached data. 
One approach would be to implement semaphores in software. This would be too expensive 
because each Smalltalk call e&cutes a short non-reentrant section of code. The approach we 
followed was to add a bit to each instruction to disable process switches. 

In Smalltalk, calls and jumps are so frequent that die virtual machine can defer a pro¬ 
cess switch until executing the next call or jump instruction. The SOAR call and jump 
instructions include a bit to specify when it is safe to switch processes [Deu82b]. This bit 








enables a software interrupt. When the operating system desires a process switch, it sets a 
bit in the Program Status Word requesting the software interrupt and resumes execution of 
the same process. The next time a safe jump or call is executed, the software interrupt 
transfers control to die operating system which can then safely suspend the process. 

Although complicated, in-line caching pays handsome rewards. The conventional way 
to cache call targets is a hash table. But the overhead for probing into a hash table would 
slow SOAR by 33% (Table A.37). The hardware penalty for in-line caching is the software 
trap mechanism. If we were forced to omit this, we could use an indirect in-line cache. The 
informations could be cached in a per-process data area instead of the call instruction. This 
would slow SOAR down by 7% (Table AJ37). Even with in-line caching, SOAR still spends 
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calling code string prim routine 


Figure 3.8: Caching the target address in the instruction stream. In this example, the print 
routine is called with an argument that is a string. (The argument is passed in iti.) The first 
time the call instruction is executed, the call contains the address of a lookup routine and the 
word after the call contains a pointer to the name "prim.” The lookup routine follows the 
pointers to the entry table for strings, and finds the entry for “prim.” It then overwrites the 
call instruction with a call to that routine and replaces the word after the call with the type 
of the argument (string). 
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Figure 3Jf: Caching the target address in the instruction stream. The next time the call is 
executed, control goes directly to the string prim routine. A prologue checks that the 
current argument's type matches the contents of the wotd following the call instruction. 

This wotd contains the type that the argument had the previous time the call was executed. 

If the types match, control fails through to the siring prim routine, otherwise another table 
lookup is needed. 

11% of its time in cache probes and another 12% handling misses. Further research into 
computing the target of die call could yield substantial savings. 


3.4J. Fast Shuffle: One Cycle Calls and Jumps 

Finally, the call instruction itself has been designed for rapid execution. In most archi¬ 
tectures, a call requires an address computation (typically the addition of an offset to a base 
register). This forces the call to take an extra cycle because its target cannot be prefetched. 
In SOAR, the call instruction contains the absolute address of its destination. Furthermore, a 
call (or jump) can be recognized easily by examining only one bit. This makes it possible to 
detect these instructions in time to send the incoming data back to the memory as an address. 
Hus way, a call or jump on SOAR executes at full speed requiring only one cycle. This 
“Fast Shuffle" mechanism combines on-chip logic to detect calls and jumps with and an 
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off-chip latch to store the incoming instruction and send it back to memory. Figure 3.10 
illustrates the Fast Shuffle logic. Though not spectacular, its performance impact is 
significant SOAR would use 11% more cycles without die Fast Shuffle. 


Pendleton has uncovered a serious flaw in our realization of the Fast Shuffle [Pen85a]. 
When a jump or call instruction follows a skip, the skip condition must be evaluated before 
the chip can signal a Fast Shuffle to the memory system. If the condition bolds, the memory 
system must use the PC as the address of the next instruction; if the conditon fails the 
memory system must use the target field from the jump or call instruction. In designing the 
instruction sec we encoded the condition field (of skip and trap) so tightly that a PLA was 
required to decode the condition and the output of the ALU. This PLA adds 110 ns to the 
time needed to compute the Fast Shuffle control signal during a skip instruction. Although 
the NMOS SOAR chips can execute an instruction in 400 ns, die memory system can not 
stan die next instruction fetch for another 100 ns. reducing the effective cycle time to about 
S10 ns. This overhead could be eliminated by foregoing the Fast Shuffle and using delayed 
branches and calls. Alternatively, the instruction set could be redesigned with a condition 
field that could be decoded mote quickly. This problem would have been found much ear¬ 
lier if we had simulated the whole system instead of the processor. 

3.4.4. The Return Instruction: Parallel Register Initialization 

The other half of the team is die return instruction. In SOAR, die return instruction 
performs one compulsory and three optional functions, specified by the low-order three 
opcode bits. The compulsory function is a transfer of control, which means that the 
bare-bones return instruction can be used as an indirect jump. If tag checking is enabled, the 
ag of the return address is checked. This provides a means to intercept returns when die 
activation record must be saved. The first optional function enables interrupts and yields a 
“return from interrupt" instruction. The second optional function increments the cwp 
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Figure 3JO: Fast Shuffle logic, when a call or jump is fetched from memory, the next in¬ 
struction is prefetched based on the external address latch instead of the PC. 

(changing register windows) for returning from a normal call. 

The Smalltalk-80 language requires local variables to be initialized to nil, so the last 
optional function for SOAR’s return instruction prepares registers 8 through 13 for a future 
call by writing nil into them. Instead of commencing each subroutine with an instruction 
sequence to write nil into each register that will contain a local variable, SOAR exploits 
VLSI circuitry to initialize the registers in parallel. Although it would be more straightfor¬ 
ward for the cal) instruction to perform this initialization, this would slow down die call. 
Instead, we have placed this functionality in die return instruction. Since die return instruc¬ 
tion must wait an extra cycle to fetch its target instruction, the "nilling” does not slow the 
instruction down. This feature eliminates the extra time required to initialize the registers 
after every call. Ironically. Smalltalk-80 subroutines use so few temporary variables — less 
than one on the average — that this feature has little favorable impact The system would 
only run 4.3% slower and use 1% more memory without it 
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3i. Object-Oriented Storage Management 

Smalltalk-80 data structures an called objects. SOAR objects average 14 words in 
length and live for about 500 instructions. Smalltalk-80 objects an smaller and mon vola¬ 
tile than data smictuns in most other exploratory programming environments. Smalltalk-80 
systems face three challenges in managing storage for objects: 

• Automatic storage reclamation — On average, 12 words of data an fined and must be 
reclaimed per 100 Smalltalk-80 virtual machine bytecodes executed. 

• Virtual memory — All objects must be in the same addnss space. 

• Object-relative addressing — Although offsets into objects an known at compile-time, 
base addnsses an not Code must be compiled to address fields relative to dynamically 
determined base addresses. 


3-5.1. Automatic Storage Reclamation 


SOAR supports Generation Scavenging to reclaim storage efficiently without requiring 
costly indirection or reference counting (see Section 5.8). This algorithm is based on the 
observation that most objects either die young or live forever. Thus, objects are placed into 
two generations and only new objects are reclaimed. A better method of storage reclamation 
has a strong impact on performance; most other algorithms would squander 10% to 15% of 
SOAR’s time on automatic storage reclamation instead of Generation Scavenging’s 3%. 
(see Chapter 5). Hence, without Generation Scavenging SOAR would takr 4% to 15% more 
cycles to run die benchmarks. 

Traditional software and microcode implementations of object-oriented systems rely 
on an object address table (Figure 3.11). Each field of an object contains an index into this 
table, and the table entry contains the address of each object. The level of indirection sup¬ 
plied by the able provides support for compaction. As explained in Chapter 5, Generation 
Scavenging provides compaction for free, permitting SOAR to function without an object 
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table (Figure 3.12). Without this algorithm, die extra work to follow die indirect pointers 
through the object table would slow SOAR down by 20% (Section 5.9.4). 


Generation Scavenging requires that a list be updated whenever a pointer to a new 
object is stored in an old object When designing SOAR, we thought that stores would be 
frequent enough to warrant hardware support for this check. Thus SOAR tags each pointer 
with the generation of the object that it points to. While computing the memory address, the 
store instruction compares the generation tag of the data being stored with the generation tag 
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Figure 3J1: Indirect addressing. In traditional Smalltalk-80 systems, each pointer is really 
a table index. The table entry contains the target’s reference count and memory address. 
This indirection required previous SmaQtalk-80 systems to dedicate base registers to fre¬ 
quently accessed objects. The overhead to update these registers slowed each procedure 
call and return. 
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Figure 3.12: Direct addressing. A SOAR pointer contain the virtual address of the target 
object. This is the fastest way to follow pointers. 



















































of die memory address (Figure 3.13). For 96% of die stores, list update is unnecessary and 
the store completes without trapping (Table A 52). Once again we rely on tags to confirm 
the normal case and trap in the unusual case. Surprisingly, tagged stores are so infrequent 
that hardware support saves only 1% of die tune and 3% of memory over an explicit check 
(Tables A.49 and A.51). This feature does not seem to worth the effort. 



Figure 3.13: Generation tag checking in parallel with a store operation. The first check ( 
1111) is for contexts and is explained in Section 3.3.2. 



































3iJ. Activation Records as Objects 

Smalltalk-80 activation records pose a special problem. Since each call needs a new 
activation record, they mast be easy to create. Because local variables reside in them, at 
least the current activation record must be easy to access. For these reasons, 
high-performance systems for other languages allocate activation records on a stack, and 
keep the active activation record in registers. The problem for Smalltalk-80 systems arises 
because the language specifies that the format and lifetime of an activation record shall be 
die same as any other object. In other words, a Smalltalk-80 activation record must be 
stored in memory with a standard object header. Worse, an activation record cannot be deal¬ 
located until the last reference to it is destroyed — even after control returns from it 

SOAR caches activation records in an on-chip register file for speed, backed with an 
overflow stack in memory. Pointers to activation record' are rare, so SOAR’s hardware 
merely detects these and causes a trap at the appropriate time. The first trap occurs when a 
reference to an activation record is created. Pointers to activation records have all the tag 
bits set. When such a word is stored into memory, the tag check causes a trap. At the time 
of the trap, the high order bit of the activation record’s return address is set Setting this bit 
indicates that the activation record may outlive its parent Since these records are normally 
allocated and freed last-in-first-out (LIFO), we label such anomalously long-lived activation 
records as non-UFO. The return instruction then craps if the renun address has the high 
order bit set — this lets software save this activation record in the heap. 

What if a program references an activation record while it is still on the stack? First, 
SOAR leaves small gaps between activation records when they are stored in main memory. 
These gaps are initialized with object headers to permit the stored activation records to 
behave as objects. Second. SOAR's hardware provides pointer-to-register addressing. Each 
load and store checks if the target address resides in the on-chip register file. If so, the chip 
substitutes a register access for a memory access. This mechanism makes it possible to 

















access on-chip activation records as if they were in memory. 

Since designing SOAR, we have come up with a software solution to the 
pointer-co-register problem. This scheme eliminates the comparator and complicated control 
logic incurring only a 3% performance penalty (Table A.53). The key idea is to generate 
illegal addresses for the unpredictable but uncommon acdvation record references, and to 
guarantee that the common and predictably referenced activation records reside in memory 
when needed (Section A.5.3). 

3i3. Virtual Memory 

The SOAR system will include dirir storage and thus supports virtual memory. Sec¬ 
tion 5.4 explains our choice of demand paging over segmentation. SOAR therefore includes 
a pin to request a page fault interrupt The uniform size and lack of side-effects of SOAR’s 
instructions simplify page fault recovery. 

3.6. Implementation 

In this section, we give a brief description of SOAR’s implementation and microarchi- 
tecture. This is covered in more detail in Pendleton's dissertation [Pen85b]. The casual 
reader may want to skip this section; those interested in details may want to read on and 
learn about the data path required for SOAR’s instruction set. Although simpler than many 
other computers, SOAR’s implementation is substantially more complex than its predeces¬ 
sor, RISC II. 

3.6.1. Special Registers 

SOAR has eight special-purpose registers that simplify the instruction set and help 
with interrupt handling (Tables 3.6 and 3.7). For instance, a register that always contains 
zero permits the assembler to synthesize moves with add instructions. Making the program 
counter available as a register provides relative addressing without adding another address- 
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ing mode. However, supporting unrestricted use of these registers would complicate SOAR. 
Three restrictions apply to these registers: 

• A result written to a special register does not take effect until the end of the next 
instruction. The SOAR microengine cannot forward special registers. 

• A special register cannot appear as the destination of a load instruction. 

• A special register cannot appear in the SOURCE2 field of an instruction. 




Table 3.6: SOAR special registers. 


Name 

Symbol 

Reg. 

Bits 

Contents 

Notes 

(zero 

rzero 

rl6 

*. 31:0 

Always ■ 0. 

For synthesizing instructions. 

program 

counter 

pc 

rl7 

; 27:0 

address of next 

instruction 

For instruction fetching, 
PC-relative addressing, 
and case statement indirect 
jump (rtt). Should not 
be modified directly, but 
only with jump. call, or 
retfinw]. 

Shadow A 

sha 

rl9 

31:0 

copy of A input 

10 ALU or shifter 

The shadow registers trade 
instructions executed when 

Shadow B 

shb 

rl8 

i 31:0 

i 

copy of B input 
to ALU or shifter 

toterrupts are enabled sod 
freeze when ireenupts are 
disabled. Thus, s 
trap-handler can save 
tune by reading operand 
from the shadow registers 
instead of decoding the 
offending instruction. 

Trap 

Base 

tb 

r21 

31:10 

base address of 
the interrupt aod 
trap vector ares 


Saved 

Window 

Pointer 

swp 

r20 

27:4 

memory address of 
object header of 
the most recently 
saved register window 

For pomter-to-register 
logic, window-overflow 
and -underflow trap logic, 
and computing address of 

Current 

Window 

Pointer 

cwp 

r22 

6:4 

index of on-chip 
register set serving 
as high window 

current activation record 

Cwp controls local register 
decoders. 

Processor 

Status 

Word 

psw 

r23 

15:0 

tee below 


















Table 3.7: Processor Status Word fields. 


Name 


shadow 

destination 


software 

interrupt 

enable 


interrupt 

enable 


shadow 

opcode 


Contents 


destination register 
field (bits 22:18) 
of last instruction 
executed with 

enabled 


| When this bit is on 
i and a call or jump 
j is executed with 
; bit 29 on, SOAR takes 
' a software trap. 


; Enables I/O interrupts 
■ and shadow registers. 


Notes 


For tnp handlers. 


For process switching. 


Disabled is interrupt 
handled. 


Unused. 


15:8 1 opcode field (bits 30:23) | For trap handled and 


of last instruction 


nap vector logic. 


executed with interrupts CAVEAT: SOAR does not 


enabled 


support nested traps. 
Trip* taken when 
interrupts ire disabled 
will not vector to 
proper opcode. 


3AI The SOAR Datapath 


The SOAR datapath includes a register file, ALU (and byte shifter), the program 
counter, memory address register, and saved window pointer. When reading, the busses are 
first precharged, then two separate registers may be read onto the busses. For writing, a sin¬ 
gle register is addressed, and the data are driven differentially on both busses (Figure 3.14). 


3.6.3. Pipelining in SOAR 

The cycle time of SOAR has been matched to memory cycle time. Each instruction is 
one word long and most can execute in one cycle. While one instruction executes, die next 
is prefetched from memory (Figure 3.15). As described above, jumps and calls require no 
address computation and therefore cause no delay in the pipeline. Conditional branches are 
synthesized with a skip and an unconditional jump. This takes two cycles, which is die same 
as a conditional branch would require. 
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Figure 3.14: The SOAR datapath, “aha and “ahb” are shadow registers A and B, “byte 
ins/ext” is the byte insertion and extraction logic, “dst" is the destination latch, and 
"MAL" is the memory address latch. 
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Figure 3.15: Pipelining in SOAR. Although an instruction takes three cycles, SOAR can 
execute one instruction per cycle. Each cycle in turn consists of three phases. 



























The anatomy of SOAR'* cycle is determined by the fact that the datapath allows two 
simultaneous precharged reads or ooe write to the register file. Each cycle is divided into 
three noaoveriapping phases, la phase one, SOAR decodes the instruction and precharges 
die busses. In phase two, the source registers are read onto the busses. In phase three, the 
ALU combines die two operands. Simultaneously, die result from the previous instruction is 
stored back into its destination register. Thus, die result of instruction i is not actually stored 
into its destination register until the end of instruction i+1. Forwarding logic hides this 
delay; if instruction i+1 attempts to read the destination register of instruction i. the desired 
value is forwarded from a latch at the output of die ALU. This has a significant effect on 
p erfor ma nce; if instead of forwarding, SOAR stalled the pipeline for a cycle die benchmarks 
would run 15% slower (Table A.54). 

Pendleton has proposed a rearrangement of the pipeline that would shorten SOAR’s 
cycle time by 23% [Fen85b]. However, die return instruction would be one cycle longer, for 
a total of three cycles per return instruction. What would be the net effect? On die average, 
SOAR performs 5.4 returns per 100 cycles (Table A.47). Thus, the effect of lengthening the 
return instruction would be to execute 5.4% more cycles. Since the new cycle time would 
be 25% faster, the new time to run the benchmarks would be l.054x7S%*79% of the old time. 
(See Section 4.1 for a description of the benchmarks.) Rearranging SOAR's pipeline would 
substantially reduce execution time. 

34.4. Implementation Statistics 

Table 3.8 contains some preliminary data for the NMOS SOAR chip, taken from 
[Pen85b]. These chips were fabricated by MOSIS [MOSIS] and performed faster than the 
simulators predicted, except for the unforeseen delay for jumps and calls described in Sec¬ 
tion 3.4.3. The MOSIS NMOS SOAR chips can execute an instruction every 400 ns, which 
must be derated to 510 ns for die jump and call delay. Pendleton has perfected the host 








Table 3.8: NMOS SOAR characteristics. 

line width 

4p 

size (w/ scribe lines) 


width 

10.7 mm 

height 

8.0 mm 

power dissipation 

-3 watts 

supply voltage 

3 volts 

transistors 

35,700 

clocks 


♦1 

90 ns 

underlap 

<10ns 

♦2 

90 ns 

underlap 

<23 ns 

♦3 

145 ns 

underlap 

40ns 

processor cycle time 

<400 ns 

fast shuffle settling time 

110ns 

minimum system cycle time 

510 ns 

actual system cycle time 

800 ns 

pads 

82 


board for SOAR, and has successfully ran the entile diagnostic suite on die SOAR chips. 
The best SOAR chip tested to date functioned perfectly with the exception of a faulty bit in 
ooe register. 


3.7. Summary 

In designing SOAR, we have attempted to find a few good ideas to supplement a basic 


RISC for Smalltalk. These are listed in Table 3.9. As a result of including all these features, 
SOAR is considerably more complicated than RISC D. The next chapter evaluates our 
architecture, and identifies its successes and failures. 
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Tabic 3.9: SOAR Architectural Ideas. 

Idea 

Section 

From 

31-bit arithmetic (with tag & overflow checking) 

2 


a tagged/un tagged mode bit in each instruction 

2 

i 

conditional skips 

2 

PDP-8 

tagged immediate values 

2 


compilation to low level instruction set 

3 

Risen i 

uniform length instructions 

3 

Risen ! 

word-addressing w/ byte-insert and -extract 

3 

MIPS, POP-10 ; 

instructions tagged as integers 

3 

! 

vectored, prioritized interrupts and traps 

3 

: 

shadow registers 

3 

! 

in-line call target cache 

4 

Xerox ST-68K 1 

software trap on jumps and traps 

4 

i 

one-cycle calls and jumps (fast shuffle) 

4 

! 

factored return instruction 

4 

! 

parallel register initialization on return 

« 

; 

load- and store-multiple 

4 

IBM-360 ! 

multiple overlapping register windows on chip 

< 

RISC 11 i 

noncontiguous load- and store-multiple 

4 ! j 

generation scavenging 

5 

! 

trapping stores of new pointers into old objects 

5 

BS j 

trapping stores of activation record pointers 

5 

BS 

trapping returns from referenced activation records 

5 

1 

pointers to registers 

5 

1 

paged virtual memory 

5 

Atlas, Sun | 

direct object addressing 

5 

BS 

special registers 

6 

RISC 11 < 

pipelined data path with forwarding 

6 

RISCII 

offline reorganization 


BS 

tag checking of addresses for load & stare 


i 

hard-wired instructions 


RISCII ! 



















Chapter 4 


Performance Evaluation of the SOAR Architecture 


4.1. Introduction 

Can a reduced instruction set computer make Smalltalk-80 practical? In this section 
we evaluate SOAR’s overall performance, place it in context with other Smalltalk-80 sys¬ 
tems, and examine features in the architecture to see which pull their weight and which are 
just a waste of effort Toward this end, we have analyzed running times and instruction 
mixes of instruction-level simulations of Smalltalk-80 benchmarks (Figure 4.1). 
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Figure 4.1: Steps involved in a SOAR simulation. First, rot removes the object table from 
the Xerox SmaUtalk-80 image. We then use BS to make any modifications necessary in the 
image (e.g. to eliminate some becomes). Newb2s produces a Smalltalk image for SOAR by 
convening the BS objects to SOAR format, and running Hilfinger’s Slapdash compiler 
which translates the bytecoded pr ogra m s to SOAR instructions. We have also coded the 
Smalltalk primitive o p era ti o ns and storage management software in SOAR assembly 
language. After this is assembled, it is fed to Daedalus, our SOAR simulator along with the 
Smalltalk image. The initials below each system indicate is author, ads is Dain Samples, 
phn is Paul Hilfinger, and dmu is David Uugar. 


























We hive instrumented the SOAR simulator to record two types of data: frequencies 
and profiles. Obtaining data from the simulator makes it possible to measure execution 
without altering the program being measured. The simulator counts the number of times it 
executes each instruction, the number of each type of trap taken, and other events. The 
simulator also samples the program counter every hundred instructions. To gather die data, 
we ran a benchmark once, reset the simulator’s counters, enable profiling, ran the bench¬ 
mark far a second iteration and then dump die raw data to files. (Appendix B contains our 
raw frequency data.) Unix™ utilities (awk and sed) analyze the data and report die usage 
and value of particular features. (Appendix A contains these results.) 

Xerox has defined an official set of benchmarks for the Smalltalk-80 system [McC83]. 
Some are called “micro-benchmarks'* because they test particular small operations like 
integer addition. The rest are called “macro-benchmarks'* because they test large opera¬ 
tions like compilation, display, and exploring system organization. These are typical 
high-level activities for Smalltalk-80 programmers. We selected five macro-benchmarks for 
our measurements. When writing Smalltalk-80 programs, we spend more time waiting for 
the compiler than for anything else. For this reason, we starred with the testCompiler 
benchmark. The other four benchmarks were chosen because they did not output to die 
display and did not require substantial modifications for SOAR. Although fast display out¬ 
put is vital for Smalltalk, it has been addressed by many others, and is outside the scope of 
this dissertation. The following descriptions of the benchmarks we chose quote from 
fMcC83]: 

testClassOrganizer 

“This benchmark measures the speed of conversion between the textual and the struc¬ 
tural representations of a class organization. The example chosen is class Benchmark 
because its organization contains many categories.’* 






tcstPriatDcfinition 


“This benchmark measures how quickly a class definition, as it appears in the system 
browser, can be generated. The example chosen is an instance of class Compiler 
because it has a moderate number of instance variables.” 

testPrintHierarchy 

“This benchmark times the printing of a portion of the Smalltalk-80 class hierarchy. 
The example chosen is class InstractionStream because it has several subclasses.” 

testCompiler 

"This benchmark measures the speed of the compiler on a slightly longer than normal 
method, one containing 87 tokens and compiling into 73 bytecodes.” 

testDecompiler 

“This benchmark measures the speed of the Decompiler by decompiling all the 
methods in class InputSensor.” 

In addition, we used a few micro-benchmarks to evaluate an upper bound for the perfor¬ 
mance impact of specific features: 

testPopStorelnstVar 

"This benchmark measures how quickly a value can be popped off the stack and 
stored in an instance variable of the receiver. Because this value is the SmalUnteger 
1, there is little reference counting overhead on the push or store. 50% of the bytes in 
the block are 16r60,* a pop of the top of the stack into the receiver's first instance vari¬ 
able.” 

test3plus4 

"This benchmark measures the speed of Smalllnteger addition. Because all values 
are Smalllntegers. there is little reference-counting overhead. 25% of the bytes in the 
block are 16rB0.* a quick send of the message 



















testActivationReturn 


"This very important benchmark uses a call on a doubly-recursive method to measure 
the speed of method activation and return. There is little reference-counting overhead 
associated with knowing when to end the recursion, but there may be a great deal in 
managing the Contexts that represent the activations. About 12.59b of the bytes exe¬ 
cuted during this benchmark are 16r£0,* a send of the method’s first literal (in this 
case, the Symbol recur:), and about 12.5% are returns, split evenly between 16r78,* a 
quick return of die receiver, and 16r7C,* a return of the value on the top of the stack." 

m 

How representative are these five macro-benchmarks? Xerox rates the performance of 
Smalltalk-80 systems relative to the Dorado by taking the mean of the 13 macro-benchmarks 
plus the text scanning and BitBlt micro-benchmarks [Bay84], Table 4.1 below compares the 
compiler benchmark, the median of the five macro-benchmarks used here, and die Xerox 
performance rating for four other Smalltalk-80 systems. The data suggest that the bench¬ 
marks we used slightly underestimate overall performance. 


We have not considered the interaction between the availability of hardware features 
and the sophistication of the optimizations performed by the compiler. The only compiler 


Table 4.1: Comparison of Performance Metrics. 


median of 
classOrganizer 


Berkeley Smalltalk on Sun 2 [Bay84] 
Tektronix 4404 [Bay84] 

Xerox PS on Sun 2 [Bay85] 

Xerox PS on Sun 3 [Bay85] 

Xerox Dorado 

SOAR (simulated @ 400 ns) 


decompiler 

printDefinition 

printHierarchy 


Xerox 

Performance 

Rating 



Tba ltr prefix denote* a hexadecimal number For example. 16r7C it 124. 

























changes we have taken into account are those required to simulate the missing hardware. 
For example, to compute the overhead of software type checking, we counted the number of 
times diet hardware type checking was performed by code from the current compiler and 
mnltipled that count by the cost of a software check. It is possible that a Smalltalk-80 com¬ 
piler for a machine without hardware support for type checking would reduce the overhead 
with a data-flow analysis to eliminate redundant type checking. However, such techniques 
are not used in existing Smalltalk-80 compilers, which must cope with dynamic type bind¬ 
ing. The performance measurements in this dissertation hold only for Smalltalk-80 systems 
with state-of-the-art compiler technology. 

42. Overall Performance: SOAR vs Dorado 

Can SOAR provide acceptable performance with a single-chip processor? The Dorado 
is the only Smalltalk-80 system that everyone agrees is fast jnough. If SOAR can run as fast 
as a Dorado, it win certainly provide a usable Smalltalk-80 system. (The Xerox MC68020 
Smalltalk-80 system is also approaching die Dorado’s performance.) Table 4.2 compares 
SOAR’s performance to the Dorado on five macro-benchmarks and die procedure call 
micro-benchmark. The Dorado numbers were obtained from Xerox’s Smalltalk-80 
Newsletter [Bay84]. The SOAR numbers were obtained by simulating the benchmarks for 
two iterations, taking the number of cycles for die second iteration,* and multiplying by 400 
ast, our measured cycle time for the 4fi chips. These data show that a 400 ns SOAR will 
perform well enough to please everyone who already uses Smalltalk-80. 


* W« coouder the second mm ice to be more n p nm mun. Had we Mad tba nem b en for tht ftret iteration, initial mb- 
realise look apt woe Id have slowed the b enchmar ts down by ap to 10%. 

♦ Implementation p r ob l em s with dw faat shuffle (Section 3.4.3) trill prevent fall speed operation unless tbe memory cy¬ 
cle bat can be radacad by 100 na over die chip cycle lime. Alternatively, the faat sheffle signal can be ignored, and tbe chip 
could mn aa a delayed branch architecmre (PaoSSaj. 
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i Benchmark 


Cycles'iter 

# iter 

SOAR 
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time 

time 



(secs) 

(secs) 

483694 
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43. Relative Performance of SOAR 

In die previous section, we showed that SOAR will run as fast as a Dorado. How does 
this compare to other Smalltalk-80 systems? Table 4.3 compares the performance of the 
compiler benchmark on several Smalltalk-80 systems. Both SOAR and die 68010 are 
NMOS microprocessors, although die 68010 has almost twice as many transistors as SOAR: 
68,000 vs. 35,700. Since Deutsch and Schiffman’s ST68K is also a compiled implementa¬ 
tion [DeS84], it serves as the fairest architectural comparison to SOAR. Unlike die ST68K 
code translator, the current SOAR compiler generates unnecessary instructions (see Table 
2.11); a better compiler would improve SOAR's performance. By creating a custom proces¬ 
sor, we have more than doubled performance, while halving the number of transistors. 


Table 43: Compiler Benchmark speed for various Smalltalk-80 systems, 
relative to Dorado, larger is faster. 





host 

processor 
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time (ns) 


execution 

model 

speed 

BS 

UCB 

68010 

400 

interpreter 

11% 

Tek 4404 

Tektronix 

68010 

400 

interpreter 

25% 

PS 
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68010 

400 

compiler 

40% 

PS 

Xerox 

68020 

180* 

compiler 

80% 

Dorado 

Xerox 

Dorado 

70 

microcode 

100% 

SOAR 

UCB 

SOAR 

400 


103% 


















































4.4. Evaluating Individual Features 


Although SOAR's design was driven by empirical results, our experimental subject at 
that time was a bytecode interpreter, not a SOAR simulator. Now that we have a compiler, 
s i mul a to r, and run-time support software for SOAR, we have been able perform an accurate 
assessment its features (Table 4.4). (Appendix A contains detailed derivations of the data.) 
Each row gives the feature’s name, the minimum, average, and maximum effect it would 
have on speed were it omitted or added, and the effect it would have cm total memory size. 
For example, the tagged integer support is described in Section 3.2. If left out of SOAR, and 
if the compiler were unchanged, the macro-benchmarks we simulated would take from 14% 
to 47% longer to run, with an avenge time penalty of 26%. The SOAR Smalltalk-80 virtual 
image would grow by 15% from its 1.5 MB. Remember that (except for rearranging the 
pipeline) our performance figures count cycles and neglect the interaction between architec¬ 
ture and cycle time. For a discussion of cycle time effects, see Pendleton’s dissertation 
[Pen85b]. 

Table 4.4 above groups die features in the order that they were presented in die last 
chapter. In Table 4.5, we have reordered them by avenge performance impact and added 
Pendleton’s complexity results in order to identify winner and losers. The complexity index 
combines the number of diagnostics, circuit blocks, and hand-drawn transistors required for 
a feature. For example, the most complicated feature, multiple on -chip register windows, 
has an index of 10. 

The importance of register windows on SOAR stems from an important feature of the 
Smalltalk-80 system, fast compilation. Like some other exploratory prog ramming environ¬ 
ments. the Smalltalk-80 system achieves split-second compilation times by compiling each 
procedure by itself; there are no macros, interprocedural analysis, nor static interprocedural 
binding. Thus, die compiler runs fast because it has shed the burden of binding or optimiz¬ 
ing subroutine calls. This results in a high frequency of subroutine calls, which forces 
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Table 4.4: S 
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4% 
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Table 45: Features in order of performance impact 
Except for rearranged pipeline, excludes impact on cycle time. 


slowdown expansion complexity 
if omitted if omitted 


compilation 
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ed immediatest 


questionable 

parallel lulling 

4.3% 

1.3% 

2.5 

trap instructions 

3.9% 

2.0% 

1.7 

loadm/storem* 

3.4% 

2.0% 

1.6 

pointer-to-register* 
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vectored traps 
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15% 
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rearranged pipeline [Pen85b] 
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7, 

load/store byte 
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mulnply/divide 
compare-and-branch 
one cycle traps 
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be considered. Fee example, once tagged integer instructions an eliminated, the penalty for elimmaimg two-tone mstrucuons 
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t The atraductiou of Generation Scavenging allowed us to exploit direct pomlers 

T Wodletoc has diacovered that SOAR't implemenution of this feature lengthened the cycle time by -25%. See Section 










































hardware to shoulder the responsibility for efficient execution of calls. This explains why 
register windows are so effective for SOAR. Although they add die most complexity of any 
feature [Pen85b], SOAR would run 46% slower without them. 

The data suggest that we could simplify SOAR without sacrificing much performance. 
If we removed all but the winning features, SOAR would only take 19% more time and 8% 
more memory. Adding Pendleton's pipeline rearrangement would then result in a simpler 
design with the same performance as the original. If we were to include more features, they 
might be trap instructions, loadm/storem, and vectored traps. Such a design would be 11% 
faster than SOAR, and use only 4% more memory. 

Four of the features in SOAR are mistakes: parallel nilling, pointer-to-register, genera* 
tion tag hardware, and shadow registers. 1 * Although fully aware of it, we still fell into what 
we now call the “architect's trap” at least four times: 

• Each mistake was a clever idea; 

• Each made a particular operation much faster, 

• Each increased design and simulation time; 

• Not one significantly improved overall performance. 

Another way to appreciate the worthlessness of these four features is that load/store byte 
instructions would save mote cycles than these four put together. 

We have put these results to use by calculating the performance of some variations on 
SOAR and comparing them to some real systems (Table 4.6). Our predictions of SOAR’s 
performance are based on simulated macro*benchmark times and do not include virtual 
memory, operating system, and I/O overhead. However, all of the Smalltalk-80 systems we 
know about tend to be compute-bound for program development For a fair comparison, we 

* Loadc end 111 neither http not hinder. Celling them rauitkes u too perjontrve; wt would rather think of them m idle 


























assume a 400 ns cycle time for SOAR, RISC D, and MC68010. 


By comparing the speeds of different systems, we can gain some insight into the rea¬ 
sons for SOAR’s good performance: 

• The speed ratio of full SOAR to RISC II, 1.6 is the same as the ratio of RISC II to the 
Xerox 68010 system. This indicates that the reduced instruction set architecture 
(including register windows) and the Smalltalk-specific hardware features contribute 
equally to performance. 

• Interestingly, the Deutsch-Schiffman 68010 compiled system is a bit better than the 
estimate for SOAR with only die software ideas. Perhaps die optimizations in 
Deutsch’s compiler account for the difference. 

• Since the Tektronix system neither compiles nor scavenges, its software resembles a 
stripped SOAR. Thus, the similar performance of die Tek system to stripped SOAR 
suggests that the stripped SOAR hardware performs as well as the MC68010. 

The simplicity and high performance of eliminating all but the winning features and rear¬ 
ranging SOAR’s pipeline make this an appealing design. 
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i5. Conclusions 


SOAR’s hardware and software design represents an advance for object-oriented 
experimental programming environments. SOAR has almost half of die transistors of the 
68010, yet runs Smalltalk-80 2J times faster. Register windows, tagged integer instruc¬ 
tions, direct pointers, and generation scavenging account for most of the difference. These 
four ideas represent SOAR’s most important contribution to EPE systems. 

Our analysis of a feature's value was based on counting cycles. Barring the pipeline 
rearrangement, we ignored die effect of adding a feature on the cycle time (see [Pen85b]). 
hi fact, some of die features we added to die machine must have perversely increased the 
cycle time enough to offset die reduction in cycles, thereby slowing down die system. In 
particular, die hardware support for automatic storage reclamation probably did not speed up 
SOAR. Other examples of mistakes in SOAR are die inclusion of parallel register nilling, 
logic to support pointers to regis t e r s , and shadow registers to aid trap handling. We observe 
that the inclusion of interesting features diet complicate the design but do not improve die 
performance of representative programs is a trap that many architects fall prey to, including 
us.* 

There are four places to look for further performance gains: compiler technology (out¬ 
side the scope of this dissertation), implementation technology (see [Pen8Sb]), optimization 
of the run-time support primitives (which consume about two thirds of SOAR’s time), and 
better hardware or software algorithms to cache call target lookups (which consume 23% of 
SOAR’s time). Of these, implementation technology — circuit design and VLSI processing 
technology — have the most dramatic impact Since we started this project, the standard 
VLSI technology available to universities has improved from 4}i line widths to 3)1. This one 
change should reduce our cycle time from 400 ns to 290 ns, as important a contribution as 

* ftmdletan has ducovtnd the SO Alt'I i n yl t awmauoo of lit* Fact Shuffle neon * 25% penalty whan (he chip « used 
wife a 400 ns mammy system (Section J4J). This dwarfs fee architectural benefit of an 11% reducion m fee number of cycles, 
fa this ease fee culprit was our failme to simulate fee memory system along wife chip. 
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register windows. Another example is Pendleton’s pipeline rearrangement which could 
improve performance by 21%. This is more than the combined effect of parallel nilling, trap 
instructions, loadm/storem, pointer-to-register, vectored traps, and generation tag-checking 
hardware. 

A 70 ns ECL Dorado is the only existing machine that runs Smalltalk-80 fast enough 
to satisfy everyone, and the 400 ns NMOS SOAR chips that have been fabricated should run 
just as fast Thus, SOAR will support the Smalltalk-80 system with excellent performance. 











Chapter 5 


Non-Disruptive High Performance Storage Reclamation 


Throw back the little ones 
and pan fry the big ones; 
use tact, poise and reason 
and gently squeeze them. 

Steely Dan, 

“Throw Back the Little Ones” 
[BeF74] 


5.1. Introduction 


Early in the SOAR project, we realized that automatic storage reclamation could easily 
become a bottleneck. We knew the overhead for allocation and freeing in Smalltalk-80 sys¬ 
tems ranged from 10% to 15% [DeS84, UnP83], that some reclamation algorithms intro¬ 
duced annoying pauses, that some required the programmer to explicitly free circular struc¬ 
tures of objects, and that most of die algorithms required microcode support. Since we 
needed to attain good performance in a system without microcode we have designed, imple¬ 
mented, and measured Generation Scavenging, a new garbage collector that 


limits pause times to a fraction of a second. 


requires no hardware support. 


meshes well with virtual memory, 


reclaims circular structures, and 


uses only 3% of the CPU time in SOAR. This is less than a third of the time of 
deferred reference counting, the next best algorithm.* 


■ Experience with SOAK hi* Bid* ut realist that tome of the other al|ohthm* that arc anally mxrocoded mad not be. 
AXhoagh oar origan] teaaoe for Marching for a or* algonthn proved to be unfounded, we found tomelhm| that enjoy* solid 
advantage* in perfo r man ce and the mil tty to reclaim circular i trucnim. 


























This section describes the challenge of providing automatic storage reclamation, sur¬ 
veys some popular algorithms, and presents our solution. It coocludes by evaluating the per¬ 
formance of Generation Scavenging, based on running the Smalltalk-80 benchmarks 
[McC83] on BS and simulating them on SOAR. An earlier and shorter version of this 
chapter appeared in (Ung84]. 

SI The Relationship Between Virtual Memory and Storage Reclamation 

The storage manager must ensure an ample supply of virtual addresses for new objects, 
and must maintain a working set of existing objects in physical memory. Traditionally, the 
functions have been separated into two parts as shown in Table 5.1 and Figure 5.1. 

Sometimes the distinction between virtual memory and automatic reclamation can lead 
to inefficiency or redundant functionality. For example, some garbage collection (GC) algo¬ 
rithms require that an object be in main memory when it is freed; this may cause extra back¬ 
ing store operations. As another example, both compaction and virtual memory make room 
for new objects by moving old ones. Thus storage reclamation algorithms and virtual 
memory strategies must be designed to accommodate each other’s needs. 


Table 5.1: Traditional decomposition of storage management. 

: name 
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j virtual memory 
' auto reclamation 
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Figure 5J: Virtual memory vs. automatic storage reclamation. 

















5J. Personal Computers Must Be Responsive 


Penonal computers differ from time-sharing systems. For example, with personal 
computers there ate no other users to blame for distracting pauses. Yet personal machines 
have time available for periodic offline tasks, for even die most fanatic hackers sleep occa¬ 
sionally. Personal computers promise consistently short response times which are known to 
boost productivity significantly [Tha81]. 


5.4. Virtual Memory for Advanced Personal Computers 

Computers with fast, random access secondary storage can exploit program locality to 
manage main memory for die programmer. Advanced personal computer systems manage 
memory in many small chunks, or objects. The Symbolics ZLISP, Cedar-Mesa, Smalltalk- 
80, and Interlisp-D systems are examples. Table 5.2 summarizes segmentation and paging, 
the two virtual memory techniques. 


5.4.1. Segmentation 

A segmented virtual memory enjoys the flexibility of placing each object in physical 
memory independently of die other objects. This packing efficiency can result in better use 
of main memory and a reduction in time-consuming backing store operations. However, 
segmentation's performance advantage disappears when main memory becomes more plcn- 
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Table 5.2: Segmentation vs. Paein 
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tiful [Sta82, So84], Moreover, the variety and quantity of objects in advanced personal 
computer systems pose tough challenges for a segmented virtual memory. In our 
Smalltalk-80 memory image, for example, die length of an object can vary from 24 bytes 
(points), to 128,000 bytes (bitmaps), with a mean of about SO. Suppose segmentation alone 
is used. When an object is created or swapped in, a piece of main memory as large as the 
object must be found to bold it Thus, a few large bitmaps can crowd out many smaller but 
more frequently referenced objects. 

When objects are small, it takes many of them to accomplish anything. Smalltalk-80 
systems already contain 32,000 to 64,000 objects, and this number is increasing. A seg¬ 
mented memory with this many segments requires either a prohibitively large or a 
content-addressable segment table. + This large number hampers address translation. 


5AI Demand Paging 

The simplicity of page table hardware and the opportunity to hide the address transla¬ 
tion time make paging attractive to hardware designers [Den70]. Paging, however, is not a 
panacea for advanced personal computers. It can squander main memory by dispersing fre¬ 
quently referenced small objects over many pages. Blau has shown that periodic offline 
reorganization can prevent this disaster [Bla83d]. The daily idle time of a personal computer 
can be used to repack objects onto pages. 

Many objects in advanced personal computers live only a short dme. The paging 
literature contains little about strategies for such objects. Since their lifetimes are shorter 
than die time to access backing store, these objects should never be paged out By segregat¬ 
ing short-lived objects from permanent ones. Generation Scavenging permits them to be 
locked in main memory. Table 3.3 summarizes the obstacles that advanced personal com- 
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puters pose for a paged viitoal memory, and the solutions that SOAR has adopted. BS and 
the DEC VAX/Smalltalk-80 system [BaS83] use paging. 


5.5. Automatic Storage Reclamation for Advanced Personal Computers 

Advanced personal computers depend on efficient automatic storage reclamation. For 
example, Berkeley Smalltalk allocates a new object every 80 instructions. This is consistent 
with Foderaro’s results for a few voracious Lisp programs [F 0 F 8 I]. Since the total sue of 
the system was in an equilibrium for these measurements, the reclamation rate must match 
die allocation rate. The mean dynamic object size is 70 bytes long. Thus, seven bits must 
be reclaimed for every instruction executed. 

Let’s examine several garbage collection algorithms and evaluate their suitability for 
advanced personal computers. Where possible, we use performance figures from actual 
implementations of these algorithms. The Xerox Dorado Smalltalk-80 system is closest to 
an advanced personal computer, when we try to compare results we shall normalize to that 
speed. For example, die bandwidth imposed on the BS storage allocator is 

70 bytes 1 object . 9000 bytecodes bytes 


1 object 80 instructions second second 

If we scale this up to die speed of the Xerox Dorado system, die storage allocation rate 

exceeds 100KB/S. 

Jon L. White was one of die first researchers to exploit the overlap between the func¬ 
tions of virtual memory and garbage collection, and he proposed that address space reclama¬ 
tion was obsolete in a virtual memory system [Whi80J. He pointed out that as long as 
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referenced objects were compacted into main memory, dead objects would be paged out to 
backing store. This strategy may have adequate performance as far as CPU time and main 
memory utilization, but it demands too much from the backing store in a Smalltalk-80 sys¬ 
tem. Even if a 100 MB backing store could keep up with the 100 KB/sec allocation 
bandwidth it would fill up in less titan an hour. 

100 MB I disk . 

■---- = 20 minutes. 

100 KB trash / second 

This is unacceptable. 

There are many automatic storage reclamation algorithms [Coh81], but they can be 
divided into two families: those that maintain reference counts and those that traverse and 
mark live objects. In the next few sections, we examine several reclamation algorithms and 
discuss their suitability for advanced personal computers. 

5.6. Reclaiming Storage by Counting References 

Reference counting was invented in I960 |Col60] and has undergone many 
refinements [Knu73, Sta80], The central idea is to maintain a count of the pointers that 
reference each object If an object’s reference count should fall to zero, the object is no 
longer accessible and its space can be reclaimed (Figure 5.2). 

5.6.1. Immediate Reference Counting 

Immediate reference counting adjusts reference counts on every store instruction and 
reclaims an object as soon as its count drops to zero. Both the Dorado Smalltalk-80 system 
[GoR83] and LOOM [KaK83, Sta82, Sta84] reclaim space with this algorithm. Compaction 
is handled separately and typically causes a pause of 1.3 seconds every 1 to 20 minutes on a 
Sun 68010 workstation. 

Counting references takes time. For each store, the old contents of the cell must be 
read so that its referent’s count can be decremented, and the new content's referent’s count 


















Figure 52: Standard reference counting. The standard reference counting algorithm asso¬ 
ciates a reference count with each object An object is reclaimed when the count goes to 
aero. Object 3 is referenced only by itself, and is thus garbage. Since its count is nonzero, 
it cannot be reclaimed by a reference counting algorithm. 

must be increased. This consumes 13% of die CPU time [Deu83b, UnP83]. When an 
object’s count diminishes to aero, the object must be scanned to decrement the counts of 
everything it references. This recursive freeing consumes an additional 3% of execution 
time [Deu82a,UnP83]. Thus, die total overhead for reference counting is about 20%. This 
substantial overhead is acceptable for personal computers, but deferred reference counting 
and Generation Scavenging (discussed below) use much less. 

Reference counting cannot reclaim cycles of unreachable objects. Even though the 
whole cycle is unreachable, each object in it has a non-zero count Deutscb [Deu83b] 
believes that this limitation has hurt programming style on the Xerox Smalltalk-80 system 
(which employs reference counts), and Lie berm an fLiH83] has also stated that circular struc¬ 
tures are becoming increasingly important for artificial intelligence applications. The advan¬ 
tage of immediate reference counting is that it uses the least amount of memory for tem¬ 
porary objects — about 13 KB when running the Smalltalk-80 macro benchmarks. How¬ 
ever. its inability to reclaim circular structures remains a serious drawback for advanced per¬ 
sonal computers. 












5.63. Deferred Reference Counting 

The Deutsch-Bobrow deferred reference counting algorithm reduces the cost of main¬ 
taining reference counts [DeB76]. Three contemporary personal computer programming 
environments use this algorithm: Cedar Mesa, InterUsp-D (both on Dorados), and an experi¬ 
mental Smalltalk-80 system which furnished the performance measurements quoted herein 
[DeS84]. The Deutsch-Bobrow algorithm diminishes the time spent adjusting reference 
counts by ignoring references from local variables (Figure 53). These uncounted references 
preclude reclamation during program execution. To free dead objects, the system periodi¬ 
cally stops, and reconciles the counts with the uncounted references. On a typical personal 
computer the algorithm requires 25 kB more space than immediate reference counting, and 
averages 30 ms pauses every 500 ms. 


Baden's measurements of a Smalltalk-80 system suggest that this method saves 90% 
of the reference count manipulation needed for immediate reference counting fBad82]. 
Deferred ref e r e n c e counting spends about 3% of the total CPU time manipulating reference 
counts, 3% for periodic reconciliation, and 1 5% for recursive freeing. Thus, deferred refer¬ 
ence counting uses about half the time of simple reference counting. 



_— — > 


Figrre 53: Deferred reference counting. The deferred reference counting algorithm does 
not count references to objects from the execution stack. A zero count does not ensure that 
an object is reclaimable; it may still have references from the stack. 













What would be the space cost for deferred reference counting on SOAR? The most 
efficient representation of a reference count on SOAR would be one word per count Table 
5.4 shows the code sequence for reference counting on SOAR. Since this sequ e nc e is nine 
words long, we can multiply the number of tagged stores by nine to compute the code over¬ 
head for reference counting oo SOAR (Table 5.5). This calculation shows that a straightfor¬ 
ward implementation of deferred reference counting would increase die image size by 16%.* 

Although more efficient than immediate reference counting, deferred reference count¬ 
ing still does not reclaim circular structures. This is its biggest drawback. 
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5.7.1. Mark and Sweep 

The first marking storage reclamation algorithm, mark and sweep, was introduced in 
1960 (McC60]. It has many variations [Cob81, Knu73, StaSO], and is used in contemporary 
systems [F6F81]. After marking reachable objects, the mark and sweep algorithms reclaim 
one object at a time, by sweeping the entire address space. Fate man has found that some 
Franz Lisp programs spend 25% to 40% of their time marking and sweeping [Fat83] and 
require about 1.9 mB for dynamic objects (compared to about 1 mB for static objects). 
These algorithms are inefficient because they access a large number of objects; the marking 
phase inspects all live objects, and the sweeping phase modifies all dead ones. 

The marking phase inspects every live object and thereby causes backing store opera¬ 
tions.* Foderero found that for some LISP p rogra m s, hints to die virtual memory system 
could reduce the number of page faults for a mark and sweep from 120 to 90 [F6F81]. Even 
with hints, marking and sweeping with paging causes on average a 4.5 second pause every 
79 seconds. This is unacceptable for an interactive personal computer. 

5.7 JL Scavenging Live Objects 

The costly phase of sweeping dead objects can be eliminated by moving the live 
objects to a new area, a technique called scavenging. A scavenge is a breadth-first traversal 
of reachable objects. After a scavenge, the former area is free, so that new objects can be 
allocated from its base. In addition to the performance savings, a scavenging reclaimer also 
compacts, obviating a separate compaction pass. Scavenging algorithms must also update 
pointers to the relocated objects. 

Automatic storage reclamation algorithms that scavenge include Baker’s semi space 
algorithm [Bak77], Ballard's algorithm [BaS83], Generation Garbage Collection [LiH83], 
and Generation Scavenging. Baker’s algorithm divides memory into two spaces and 

* The swoop phase also requires backing store operations, hot its sequential nature accommodates prefetchutg. 













scavenges all reachable objects from one space to the other (Figure S.4). Ballard imple¬ 
mented this algorithm for his VAX/Smalltalk-80 system and observed that many objects 
were long-lived. The addition of a separate area for these objects resulted in a substantial 
performance improvement by eliminating the periodic copy of them. Ballard’s system has 
' 600 KB for static objects, a 512 KB object table, and two 1 MB semispaces for dynamic 
objects. It spends only 7% of its time reclaiming storage, including sweeping the object 
table to reclaim entries. Since it is embedded in an interpretive system that runs 
Smalltalk-80 programs a twelfth as fast as the Dorado (Table 2.2), the CPU overhead for this 
algorithm may rise above 7% on a high-performance system. 

Generation Garbage Collection [LiH83] exploits the observation that many young 
objects die quickly and generalizes Baker’s algorithm by segregating objects into genera¬ 
tions, each within its own space (Figure 5 J). Each generation may be scavenged without 
disturbing older ones, permitting younger generations to be scavenged more often. This 
reduces the time spent scavenging older, more stable objects. At present, there are no pub¬ 
lished performance data on this algorithm. 

The scavenging algorithms above incur hidden costs because they interleave scaveng¬ 
ing with program execution. The key idea is to avoid pauses due to scavenging by subdivid¬ 
ing die work and scavenging a few objects every time a new one is allocated. The problem 
with mixing execution with reclamation is that the program may try to use a pointer to an 



Figure S.4: Baker semispaces. The Baker storage reclamation algorithm divides memory 
into semispaces. When one fills up, the live objects in it are copied to the other semispace. 
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Figure 5 J: Generation garbage collection. Geoendoa garbage collection is a generaliza¬ 
tion of Baker semispaces. This algorithm divides memory into many small semispaces, one 
per “generation.” When a semis pace fills op, its contenu are scavenged to the next one. 

object that has been scavenged to another area. This problem can be solved by checking all 
loads and following the forwarding pointers, but the solution in turn imposes additional 
overhead on die running program. Thus, eliminating pauses slows execution. 

Algorithms that segregate objects into generations must maintain tables of references 
from older to younger objects. These algorithms save time by reclaiming space in younger 
generations without traversing older generations. The burden of maintaining these tables 
falls on some store instructions. 


5 J. The Generation Scavenging Automatic Storage Reclamation Algorithm 

Generation Scavenging arose from our attempts find an efficient, unobtrusive storage 
reclamation algorithm for SOAR that did not require microcode. Our test vehicle was 
Berkeley Smalltalk, which originally used referen ce counting. Measurements of BS object 
lifetimes proved that young objects die young and old objects continue to live. We then 
designed Generation Scavenging to exploit that behavior and substituted it for reference 
counting in Berkeley Smalltalk. The result was an eight-fold reduction in the percentage of 
time spent reclaiming storage — from 13% to 13%. In addition, the intrinsic compaction 
provided by scavenging made it possible to eliminate the Object Table and its accompanying 
indirection. After eliminating the object table and reference counting, BS ran 1.7 times fas¬ 
ter than before. In addition to the performance improvement since Generation Scavenging 
was not based on reference counting, it was able to reclaim cycles of unreachable data struc- 
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5.8.1. Overview of Generation Scavenging Algorithm 

Each object is classified as either new at old. Old objects reside in a region of memory 
called die old area. All old objects that reference new ones are members of die remembered 
set. Objects are added to this set as a side effect of store instructions. (This checking is not 
required for stores into local variables because stack frames are always new.) Objects that no 
longer refer to new objects are deleted from die remembered set during scavenging. All new 
objects that are referenced must be reachable directly from the old objects in the remem¬ 
bered set, or through a chain of new objects ultimately linked to die remembered set. Thus, 
a traversal in new space, starting at die remembered set (and virtual machine registers) can 
find all live new objects. Table 5.6 summarizes die characteristics of the two generations for 
Generation Scavenging. 

There are three areas for new objects (Figure 5.6): 

• NewSpace, a large area where new objects are created, 

• PastSurvivorSpace, which holds new objects that have survived previous scavenges. 


• FutureSurvivorSpace, which is used only during scavenging. 

A scavenge moves live new objects from NewSpace and PastSurvivorSpace to FuturcSur- 
vivorSpace, then interchanges Past and FutureSurvivorSpace. At this point, no live objects 


1 Table 5.6: Generations in Generation Scavenging for BS. 

contents 

volatile objects permanent objects 

residence 
space size 
location 
created by 
reclaimed by 
reclaimed every 
reclamation takes 

new space old space 

200 KB* 940 KB 

main memory demand paged 

instantiation tenuring 

scavenging mirk-and-sweep 

16 sec 3 - 8 tars 

0.16 sec 5 min 















are left m NewSpace, and it can be reused to create more objects. The scavenge incurs a 
space cost of only one bit per object Its time cost is proportional to the number of live new 
objects and thus is small since only 1 in 20 objects survive a scavenge. If a new object sur¬ 
vives enough scavenges, it moves to the old object area and is no longer subject to online 
automatic reclamation. This promotion to old status is called tenuring. Figure S.7 depicts 
both die old and new areas for Generation Scavenging. 


SJ1 Detailed Description of Generation Scavenging 

Recall that the purpose of a scavenge is to transport the surviving new objects from 
NewSpace and FastSurvivorSpace to FuturesurvivorSpace. A one-pass breadth-first algo¬ 
rithm copies the objects and updates pointers to them as it goes along. It starts by searching 
all the old objects in the Remembered set for pointers to new objects, which it copies to 
FutureSurvivorSpace. Then, it updates the pointer to point to the copy instead of the origi¬ 
nal, leaves another pointer to the copy in the first word of die original, and sets a flag bit to 
indicate that die original has been moved. If the scavenging algorithm encounters a refer- 
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scavenge 


objects created here 


° survivors of previous scavenge 
c c scavenge objects to here 
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scavenge 


objects created here 
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° survivors of previous scavenge 


Figure 5.6: Generation Scavenging's three areas for new objects. The largest area holds 
newly-created objects (NewSpace). Two smaller areas alternately hold objects that have 
survived previous scavenges (FastSurvivorSpace) and receive objects copied by the current 
scavenge (FutureSurvivorSpace). This unbalanced division saves memory over a sem¬ 
ispace algorithm. 
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Figure 5.7: Birds eye view of Generation Scavenging. After an object has survived 
enough scavenges, it is promoted to the old object area. New objects are locked down in 
physical memory; old objects reside in virtual memory and may be paged out 


ence to the same object again, the flag bit and forwarding pointer will enable it to detect that 
the object has already been scavenged and to update the reference. After this first pass, all 
new objects referenced by old object have been scavenged. Now, the algorithm starts 
traversing FutureSurvivcrSpace and scavenging any new objects referenced from there. As 
more objects are copied, the end of FutureSurvivorSpace grows away from the scan, until 
finally, all live new objects have been scavenged and the scan catches up to the end. At this 
point, the algorithm terminates. 

In addition to preserving live objects, those objects that survive for a long time must be 
p romo ted into OldSpace. If they were not. much time would be wasted copying and recopy- 
ing she same objects back and forth. So, each object includes a count of the number of 




scavenges it has survived. If this count should reach a certain threshold, die object gets 
scavenged to OldSpace instead of FutureSurvivorSpace. At this point, the object must be 
added on to the end of the renumbered set in case it contains any pointers to other new 
objects. After completing a pass, die algorithm checks the remembered set. If it has grown, 
die new part is scanned, which may add objects to the end of FutureSurvivorSpace. Then, if 
FutuieSurvivorSpace has grown, the new portion of that area must be scanned, which may 
add objects to die end of the remembered set. The final form of the algorithm, therefore 
resembles two coroutines: one which searches the remembered set, and another which 
searches FutureSurvivorSpace for pointers to new objects. This is easily implemented in C 
with two subroutines called alternately in a loop. The loop terminates when one of the sub¬ 
routines completes without adding more objects for die other one to scan. 


We now present the Generation Scavenging algorithm top-down, in pidgin C: 
struct space { 

word t •firstWord; /* start of space */ 

int sue; /* number of used words in space */ 

}; 


struct object { 

int size, 
age; 

boolean isForwarded, 
isRemembered; 

union { 

struct object *contents[], 

"forwardingPointer, 

}; 

}; 


'struct space NewSpace, PastSurvivorSpace. FutureSurvivorSpace, OldSpace; 

struct object *RememberedSetContentsrMaxRemembered]; 
int RememberedSetSize; 





















The main routine, generationScavenge, first scavenges the new 
objects immediately reachable from old ones. Then it scavenges 
those that are transitively reachable. If thus results in 
a promotion, the promotee gets remembered, and it first 
scavenges objects adjacent to the promotee. then scavenges the 
ones reachable from the promoted. This loop continues until 
no more reachable objects are left. At that point, 
PastSurvivorSpace is exchanged with FutureSurvivorSpace. 

Notice that each pointer in a live object is inspected once and 
only once. The previousRememberedSetSize and 
previousFutureSurvivorSpaceSize variables ensure that no object 
is scanned twice, as well as detecting closure. If this were 
not true, some pointers might get forwarded twice. 


generationScavengeO 

{ 

int previousRememberedSetSize ■ 0; 

int previousFutureSurvivorSpaceSize »0; 

while (TRUE) { 

scavengeRememberedSetStartingAt(previousRememberedSetSize); 
if (previousFutureSurvivorSpaceSize •» FutureSurvivorSpace.size) 
break; 

previousRememberedSetSize - RememberedSetSize; 
sea vengeFutureSurvSpaceS tarring At( 
previousFutureSurvivorSpace.size): 
if (previousRememberedSetSize »» RememberedSetSize) 
break; 

previousFutureSurvivorSpaceSize » FutureSurvivorSpace.size; 

} 

exchange(PastSurvivorSpace. FutureSurvivorSpace); 



































scavengeRememberedSetStartingAt(n) inverses objects in the remembered 
set starting at die nth one. If the object does not refer to any new 
objects, it is removed from die set Otherwise, its new referents 
are scavenged. 


scavengeRememberedSetStartingAt(dest) 

intdest; 

< 

int source; 

for (source « dest: source < RememberedSetSize; +-*-source) 
if (scavengeReferentsOf(ReroemberedSet(source])) { 
RememberedSetContents[dest-^-f] ■ 
RememberedSetContems[ source]; 

} 

else 

resetRememberedFlag(RememberedSetContents[source]); 
RememberedSetSize - dest; 


scavengeFutureSurvSpaceStaningAt(n) does a depth-first 
traversal of the new objects starting at the one at the nth word 
of FutureS urvivorSpace. 


scavengeFutureSurvSpaceStartingAdn) 

intn; 

{ 

struct object ’cunentObject; 

while (n < FutureSurvivorSpace.size) { 
scavengeReferentsOf( 

cunentObject« FutureSurvivorSpace.firstWordfn]); 
n +» sizeOfObject(currentObject)) 

} 






























* scavengeReferentsOf(anObject) inspects all the pointers in anObject. 

* If any are new objects, it has them moved to FutureSurvivorSpace, 

* and returns truth. If there are no new referents, it returns falsity. 

* For simplicity here, an object is just an array of pointers. 

•/ 

scavcngeReferentsOf( anObject) 
struct object * anObject; 

< 

inti; 

boolean foundNewReferrent; 
struct object ‘referent; 

foundNewReferent» FALSE; 
for (i« 0; i < anObject->size; i++) { 

referrent * anObjecLcontents[i]; 
if (isNew( referrent)) { 

foundNewReferrent - TRUE; 
if (!isForwarded( referrent)) 

copy AndForwardObject( referent); 
anObjectcontents[i]» referent->forwardingPointer; 

} 

} 

return (foundNewReferrent); 

} 

/• 

* copyAndForwardObjcct(obj) copies a new object either to 

* FutureSurvivorSpace, or if it is to be promoted, to OldSpace. 

* It leaves a forwarding pointer behind. 

•/ 

copyAndForwardObject(oldLocadon) 
struct object ‘oldLocation; 

{ 

struct object ‘newLocanon; 

if (oIdLocabon->obj_age < Max Age) { 

+-*-oldLocation->obj_age; 
newLocanon * copyObjectToSpace(oldLocation, 
FutureSurvivorSpace); 

} 

else 

newLocation * copyObjectToSpace(oldLocadon, OldSpace): 

oldLocation->obj_forwardingPointer « newLocation; 
oldLocation->obj forwarded TRUE; 







































How do old objects get reclaimed? Aa offline reclamation program traverses and 
copies all objects in depth-first order to a file. This is a three-pass algorithm: The first pass 
copies the live objects to a file and leaves forwarding pointers in the original objects. The 
second pass traverses the file and updates the pointers. The third pass reads the file into 
memory, overwriting the original area. Copying rearranges the objects into depth-first order, 
which helps to reduce the number of page faults [Bla83b, Bla83d, Sta82, Sta84]. The whole 
process takes a few minutes. If it is only required once or twice a day. it should not be too 
disruptive. 

SJJ. Comparing Generation Scavenging to Other Scavenging Algorithms 
Generation Scavenging most resembles Ballard's scheme [BaS83]: 

• It segregates objects into young and old generations. 

• It copies live objects instead of sweeping dead objects. 

« It reclaims old objects offline. 

Generation Scavenging differs from Ballard's Semispaces and Liebennan-Hewitt’s Genera¬ 
tion Garbage Collection [LiH83]. Unlike those algorithms. Generation Scavenging 

• conserves main memory by dividing new space into three spaces instead of two. 

• is not incremental. Instead, the small pauses introduced by Generation Scavenging are 
unnodceable in normal interactive sessions. (They are noticeable in real-dme applica¬ 
tions such as animation.) Incremental algorithms require checking on every load 
instruction, and Generation Scavenging saves this time by not being incremental. 
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5 A Performance Evaluation of Generation Scavenging 

How well does Generation Scavenging perform in Berkeley Smalltalk and SOAR? We 
concentrate on four metrics: 

• CPU time overhead, tbe CPU time spent reclaiming storage divided by the total CPU 
time in tbe session, 

• pause time, the time that tbe user must wait for reclamation, 

• peak main memory usage, tbe amount of main memory that must be dedicated for tem¬ 
porary objects, and 

• backing store accesses, the number of times that the reclamation algorithm requires 
data not present in main memory. 

5.9.1. Evaluating Generation Scavenging in Berkeley Smalltalk 

The Smalltalk-80 macro benchmarks [McC83] consist of representative activities like 
compiling and text editing. We measured the performance of Generation Scavenging in BS 
while running these benchmarks. Although our workstation had 2 MB of main memory, 
only about half of that was available to Berkeley Smalltalk. Table 5.7 shows the results. 

CPU Time Cost Our measurements of BS show that Generation Scavenging requires 
only 1.5% of the total (user CPU) time. This is four times better than its nearest competitor, 
Ballard's modified semispaces, which takes about 7%. 

One reason that Generation Scavenging looks so good is that BS executes programs 
more slowly than some other Smalltalk-80 systems. However, the next section shows that 
Generation Scavenging performs well on fast Smalltalk-80 systems. 

Main Memory Consumption. Although each of the three new object areas occupies 
140 KB of virtual memory (420 KB total), only 28 KB of each survivor area gets used. The 
rest serves as a reserve against pathological survival and need not be resident. Thus, the 
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Table 5.7: Performance of Generation Scavensinc in BS 


total ins true dons executed 

4500 k 

amount of storage reclaimed 

3900 KB 

amount of tenured storage 

9.1 KB 

number of checked stores 

190 k 

number of remembered objects 

320 

number of scavenges 

32 

mean length of survivors 

4.8 Kword 

total user CPU time 

280 secs. 

total Real time 

500 secs. 

real time scavenging 

1.8% 

user time scavenging 

1.5% 

time checking stores 

0.1% 

max old space used 

940 KB 

max new space 

140 KB 

max survivor space 

28 KB 


total size 
resident set size 


1800 KB 
930 KB 


min pause time* 

90 ms 

median pause time* 

150 ms 

mean pause time* 

160 ms 

90th %ile pause time* 

220 ms 

max pause time* 

330 ms 

mean time between scavenges 16 seconds 


total primary memory cost for dynamic objects is 200 KB, about 10% of the BS main 
memory. If we used Baker semispaces with die same scavenging rate, each space would 
need to be 140KB + 28KB, for a total of 360 KB, almost twice as much as Generation 
Scavenging. 

Backing Store Operations. Since new objects are always created in the same area, 
they can remain in main memory. Unfortunately. Unix on the Sun 68010 workstation (Sun 
Release 2.0) does not implement the system call that would lock down this area. Thus, the 
first six scavenges caused 283 minor page faults (page reclaims), and the rest of the 
scavenges caused four. With a working set of 930 KB, 60 major page faults occurred during 
die benchmarks. 
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Pauses. Except for the page faulting during the first six scavenges (see above), the 
pauses were small and mostly unobtrusive, avenging 150 ms. The longest pause was only 
330 ms. About 15% of the pause time was spent in die Unix kernel on unrelated overhead. 
Since people have difficulty noticing pauses of 100 ms, this algorithm's performance meets 
our requirements. 

532 Evaluating Generation Scavenging on SOAR 

The previous section shows that Generation Scavenging performs well in BS, requiring 
fewer than 1.5% of die CPU cycles. How well will this algorithm perform on SOAR? 
SOAR will run Smalltalk programs ten times faster than BS. This will result in ten times 
more garbage created in the same amount of time, but, we would not expect Generation 
Scavenging to run ten tunes fester on SOAR than in Berkeley Smalltalk. If it ran at the 
same speed, then the overhead for scavenging on SOAR would be ten times worse, or 15%. 
In fact, as we show in Section 5.9, Generation Scavenging takes only about 2% of SOAR’s 
time. 

5521. SOAR Scavenge Duration 

We have written Generation Scavenging in SOAR assembly language and simulated it 
in the course of running the macro benchmarks. Table 5.8 gives measurements of 12 
scavenges, 9 from the decompiler benchmark, two from the printDefinition benchmark, and 
one from the compiler benchmark. (See Chapter 4.1 for a description of the benchmarks.) 
As expected, the duration of a scavenge can be predicted from the number of words of new 
objects that survive the scavenge. Figure 5.8 superimposes the observed data with a linear 
regression. The regression predicts that the number of cycles for a scavenge is 
2*Ksurviving -words +3500 with a correlation coefficient r of 0.976. 

The last column of Table 5.8 gives the duration, or pause time of each scavenge, 
assuming 400 ns per cycle. Despite identical cycle times. SOAR’s mean scavenge time was 

















Table 5JJ: Statistics on twelve scavenges simulated for SOAR. 
The last column assumes a cycle time of400 ns. 



name of 

scavenge 

data 

cycles 

^caveng^ 


benchmark 

time 

scavenged 

per 

time 



(cycles) 

(words) 

word 

(ms) 

1 

decompiler 



23 

23 

2 

decompiler 

43,832 

2,028 

23 

19 

3 

decompiler 

45,491 

2,022 

22 

18 

4 

decompiler 

41,262 

1,828 

23 

17 

5 

decompiler 

69,937 

3,114 

22 

27 

6 

decompiler 

37,449 

1,692 

22 

15 

7 

decompiler 

37,157 

1,693 

23 

16 

8 

decompiler 

30,100 

1,489 

20 

12 

9 

decompiler 

29,228 

1,489 

20 

12 

10 

printDefinition 

63,417 

1M2 

25 

25 

n 

printDefinition 

53.535 

2M1 

21 

22 

12 

compiler 

60,374 

2,834 

21 

24 


min 


1,500 

20 

12 


25%ile 

37,000 

1,700 

21 

15 


median 

45,000 

2,000 

22 

18 


mean 

48,000 

X200 

22 

19 


(s.d.) 

(13,000) 

(540) 

(1-4) 

(5.0) 


75%ile 

57,000 

2^00 

23 

23 


max 

70,000 

3,100 

25 

27 





































































scavenge to copy them. On the other hand. SOAR allocates activation records in a 
separate stack that gets scanned rather than copied. The numbers show that die aver* 
age BS scavenge copied 4.8 Kwords whereas the average SOAR scavenge copied only 
2.1 Kwords. This accounts for 23 times the work. 

The above two explanations together account for a factor of 4.6, leaving a factor of 1.8 per* 
formancc improvement to be explained by the next two differences (which are harder to 
quantify): 

• Assembly code can be more efficient than C. Generation Scavenging is written in 
assembler for SOAR and in C for BS. 

• SOAR’s architecture runs programs faster than the 68010's. In particular, the reduced 
instruction set, register file, word addressing, fast shuffle, and tag checking hardware 
might contribute to die performance improvement of scavenging in SOAR. 

5.9.22. SOAR Scavenge Frequency 

The worst SOAR scavenge took 27 ms, which is well below the threshold for an 
annoying pause. However, if the time that a program could run between scavenge and the 
next were too short, the 27 ms pause would still be unacceptable. The length of this gap 
between pauses is determined by the creation rate for new objects and the by amount of 
memory available to hold diem. To measure this interval, we ran six benchmarks on SOAR 
and measured die rate of object creadoo during a (randomly chosen) portion of each. The 
data are presented in Table 5.9. With 150 KB available for newly-created objects. 2.3 
seconds of computation will be available to amortize the 27 ms scavenging pause. The crea¬ 
tion rate would have to grow by an order of magnitude to be a problem. 







Table 5.9: Space allocation rate benchmarks on SOAR. 
(Samples are complete second iterations of each benchmark.) 
(Assumes new area size » 150KB, cycle time » 400 ns.) 


benchmark 

duration 

space 

growth 

growth 

scavenge 



allocated 

rate 

rate 

interval : 


(cycles) 

(words) 

(w/ltc) 

(kw/sec) 

(secs) i 

decompiler 

2,958.219 

36.886 

12 

31 

1.2 

printHierarcby 

119,040 

1.426 

12 

30 

1.3 j 

allimplementors 

2,257,051 

18,058 

8.0 

20 

1.9 ' 

printDefinition 

75,319 

509 

6.8 

17 

2.3 ' 

compiler 

1.117,660 

7,467 

6.7 

17 

2.3 

dassOrganizer 

2,959.728 

9.905 

3.3 

8.4 

4.6 

mean 

— 

— 

8.1 

21 

23 

s.d. 

— 

— 

3.4 

8.6 

1.2 ’ 


5.923. Net SOAR Scavenge Overhead 

Given the above data, we can calculate the pause time, gap between scavenges, and 
average scavenge overhead (Table 5.10). The results that generation scavenging is 
non-disruptive; a 27 ms pause every second is hard to notice. Furthermore, scavenging uses 
less than 2% of the CPU time, allowing the computation to proceed at full speed. 

S.9.2.4. Generation Scavenge Trap Time 

Recall that die Generation Scavenging algorithm maintains a table of references from 
old to new objects. SOAR traps when it creates such a reference, enabling the trap routine to 
enter the address of the referenced object in die table. Table 5.11 gives an analysis of store 
nap overhead for the simulated macro benchmarks. The path length of 100 cycles for a store 
trap was determined by assuming a 1 in 8 chance of window overflow, and taking the worst 
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best case 

average 

worst case 

pause time 

12 ms 

19 ms 

27 ms 

: scavenge interval 

4.6 secs 

2.3 secs 

12 secs 

scavenge overhead 

03% 

0.8% 

2.3% 

trapping overhead 

0% 

0.05% 

1.0% 

' total overhead 

03% 

0.9% 

3.3% 
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case for the other branches. The wont case overhead to maintain the remembered set is 1%, 
with a median of 0.05%. 

5JJ. Seminary of Generation Scavenging’s Performance 
Table 5.12 summarizes our findings. See Appendix D for a more detailed description. 
Generation Scavenging offers outstanding performance: 

• At 3%, its CPU overhead is three times lower than deferred reference counting, its 
nearest co m p e tit o r on a compiled Smalltalk-80 system. The overhead is so low that 
designen of high-performance systems who formerly shunned automatic storage recla¬ 
mation can now embrace it 


Table 5.11: Generation Scavenge Store trap 

ping overhead in SOAR. 

Benchmark 

Benchmark 

Cycles 

store 

traps 

store 

trap 

cycles 

store 

trap 

overhead 

Name 

l 

! 

' decompiler 

2,958,219 

0 

0 

0% 

alllmplememors 

2,257.051 

1 

100 

0.004% 

classOrganizer 

2,959,728 

14 

1,400 

0.05% 

compiler 

1,117,660 

7 

700 

0.06% 

printDefininoo 

75,319 

1 

100 

0.13% 

printHierarchy 

119.040 

12 

1.200 

1.0% 

median 


0.05% 


! Table 5.12: Summary of Generation Scavenging’ 

i Performance. 


Berkeley Smalltalk 

SOAR 

{ execution model 

interpreted 

compiled 

| source of data 

measurements 

simulations 

I processor 

MC68010 

SOAR 

> cycle time 

400 ns 

400 ns 

I CPU time overhead 

1.5% 

0.9% 

1 worst case 

n.a. 

3.3% 

pause time (scavenge duration) 

160 ms 

19 ms 

worst case 

330 ms 

28 ms 

peak main memory usage 

200 KB 

200 KB 

becking store accesses 

0.15 

n.a. 


-f •->.Vary* v* 















• The short pause times for Generation Scavenging are a good match to an exploratory 
programming environment Since people have difficulty noticing pauses of 100 ms, 
they will not be disturbed by pauses of 28 ms. 

• The 200 KB of main memory needed for temporary data exceeds the space require¬ 
ments of most older algorithms. However, given the state of the an in computer 
memory hardware. 200 KB of overhead seems reasonable for a system with 2 MB of 
main memory. 

• Ideally, automatic storage reclamation should not cause any page faults. Even without 
any provisions for locking new and remembered objects in main memory, BS averaged 
only 1 page fault per seven scavenges. 

5.9A Performance Evaluation of Direct Addressing on SOAR 

Because Generation Scavenging includes compaction, the usual indirection through an 
object able is unnecessary in BS and SOAR, making them the only Smalltalk-80 systems 
without object tables. The indirection through such a able is sometimes overlooked when 
evaluating reference-counting reclamation, but it can be a bottleneck; a typical Smallalk-80 
system accesses the object able 12 times per bytecode [UnP83]. Assuming SOAR per¬ 
forms as fast as the Dorado (300KB.c/.s), SOAR would access the object table 360,000 
times per second. The absolute minimum able access would be a single load instruction. 
Assuming 400 ns per cycle, such an indirection would take two cycles, of 800 ns. At 
360.000 able accesses per second, that would be 0.29 seconds of indirection time for each 
second of processing time. Discussions with Deutsch suggest that further optimization pos¬ 
sibly could halve this overhead. In other words, an object able would slow SOAR by 15% 
to 29%. 

Although we eliminated the object able to improve performance, there is one 
Smallalk-80 primitive operation that runs much slower without it. The become: primitive 








































original \ create new array put ; switch internal pointer 


\i/ ! \i/ i \i/ 



Figure 5 JO: Growing without become. The sequence above illustrates how our modified 
sets grow without resorting to become:. The contents are stored in a separate array. To 
grow, the set allocates a larger array, initialises it, and redirects an internal point to the 
new array. We have replaced costly implicit indirection with explicit indirection that incurs 
cost only when needed. This is in keeping with the RISC philosophy. 


Tabic 5.13: Performance impact of eliminating becomes. 


benchmark # becomes duration duration cycles 

w/ becomes w/o becomes saved 


printDefimtion 




75,475 

1,383,201 

4,045,641 

165,997 


75,317 

1,127,658 

3,006,974 

119,574 


any becomes. But, our efforts to eliminate becomes from programs that did use them were 
handsomely repaid with an 18% to 28% performance improvement. 


Although we have eliminated becomes invoked by the system classes, the SOAR pro¬ 
grammer must either shy away from this primitive, or be prepared to pay a stiff performance 
penalty. Forcing the user to worry about the efficiency a primitive operation nuts counter to 
the philosophy of exploratory programming environments in general and Smalltalk-80 in 
particular. However, we believe that the become primitive is so «intrinsically 
expensive—fast becomes require a level of indirection that slows down many frequent 
operations—that the effort to accomplish a become should not be hidden. 
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We have also estimated the impact of indirection on code size. An Object Table would 
require an extra instruction to load or store a literal variable, and one indirection in the 
method prologue (for the receiver). (We are assuming that many indirections will be optim¬ 
ized away, as in Deutsch and Schififman’s system.) Table 5.14 presents our analysis under 
these assumptions. The extra code for an object table would add only 2% to the size of the 
system. 

5.9.5. Architectural support for Storage Management 

The SOAR chip supports demand-paged virtual memory with restamble, fixed sized 
instructions and a page fault interrupt [SKF85]. An off-chip page map translates addresses 
and maintains referenced information. The silicon cost for virtual memory is about 20 sup¬ 
port chips including the page map. Figure 5.11 shows that the SOAR host board hides the 
page map access time in memory access time [B1D83]. 

To support Generation Scavenging, all pointers include a four-bit tag. When a store 
instruction stores a new pointer into an old object, a special trap occurs. The software trap 
handler then records the reference. The tag-checking PLA has 8 inputs and one output, and 
occupies about 0.1% of the total chip area. The cost of the extra control logic to handle the 
trap is harder to measure. As mentioned in Chapter 4, tagged store instructions occur so 
rarely that even this small cost cannot be justified. 


Table 5.14: Static cost of object indirection. 

method prologues 

4654 

literal variable loads 

3532 

literal variable stores 

254 

total image size 

1,500 kB 

relative cost of additional code 

2.25% 






onset into page 


page map physical page # 


virtual 

page# 


page offset to RAM 


access page map 


page# to RAM 


Figure 5J1: Fast address translation. The SOAR system has adopted the same technique 
as the Sun 68010 workstation to perform address translation without hurting performance. 
It hides the translation time in the address multiplexing delay for the dynamic RAM chips. 
On each memory access, the low order address bits that specify the offset into the page are 
sent to the memory while simultaneously reading the page map. The physical page number 
is then sent to the memory as the second piece of the address. A virtual memory with one 
segment per object could not run as fast because the offset into a segment is not identical to 
the least significant bits of the physical address. Consequendy, no portion of the virtual ad¬ 
dress can be sent immediately to the RAM chips. 


5.9.6. Generation Scavenging and Activation Records 

We have simplified this chapter by deliberately omitting activation records. In this 
section, we outline the problems caused by activation records in Smalltalk-80 and our solu¬ 
tions to them. Activation records present a problem because a Smalltalk-80 program can 
manipulate them like any other object. For instance, a subroutine can obtain a pointer to its 
activation record and place it in a global variable. After the subroutine returns, another rou¬ 
tine can inspect the activation record via die global variable. Since SOAR activation records 
are kept in the register frame stack, extraordinary measures are required to preserve this 
information. When a Smalltalk-80 program creates a reference to an activation record we 
marie it as non-lifo. When a non-lifo activation is about to be destroyed (i.e. when a return 


instruction attempts to free it), we copy the record to the heap and adjust the references to it. 





































Thus, the steps are: 

1) Detect the creation of a non-lifo reference to an activation record, then mark the 
activation record as non-lifo: 

A non-lifo reference can be created by storing a pointer to an activation record or by 
returning such a pointer as a result We have allocated a distinct tag for activation 
records (context or 1111). A tagged store instruction will trap when storing such a 
pointer. As for returns, the SOAK compiler generates a trap instruction before each 
return that checks the tag and traps if needed. The trap handler sets the high-order bit 
of the activation record’s return address. This marks the activation record as non-lifo. 
Meanwhile, the reference is added to a software table so it can be updated later. 

2) Detect a return from a non-lifo activation record, then copy it and update any refer¬ 
ences to it. 

The return instruction traps if the return address has its high-order bit set This trap 
handler then allocates space in die new area for the activation record, copies it and 
updates references to it At this point there is no need to trap further stores, so the 
reference's tag is changed to new. 

We have extended this strategy to include blocks. Smalltalk-80 blocks implement con¬ 
trol structures by allowing one routine to control execution in another's context Fre¬ 
quently, a block is created, passed down the call chain to a subroutine that repeatedly 
invokes the block and then returns. Thus, we must impose a minimum of overhead on this 
case, while handling non-lifo references to blocks. In other words, although a block is an 
object that refers to a context we do not mark the context as non-lifo until the block itself 
becomes non-lifo. This is accomplished with the same mechanism outlined above; using the 
context tag for block objects. 
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5.9.7. The Potential Problem of Premature Promotion 

Recall that Generation Scavenging is based on the assumption that the longer an object 
survives the longer it will remain alive. Therefore, when an object attains a ripe old age, it is 
promoted from the new generation to die old. At this point, the system assumes that die 
object is immortal and ceases attempts to reclaim it For this reason, we call the promotion 
process tenuring. However, in some cases die object may die shortly thereafter and waste 
space long after its useful life. 

At first glance, one would expect dead tenured objects to waste backing storage, but 
not main memory. They would seem to get paged out to make room for enured objects that 
remain alive. However, because an object is so small relative to the size of a page (14 vs. 
1024 words), a page could easily contain just a few live objects among many dead ones. 
This internal fragmentation could tie up much more main memory than is actually needed 
for the live objects. In this manner dead tenured objects can increase the number of pages in 
the working set. 

How severe is this problem? We plan to reclaim dead tenured objects once a day by an 
offline reclamation program. How many will build up in a day? We won’t know until we 
measure the lifetimes of objects over hours of elapsed time on a high-performance system 
like the Dorado or SOAR. Chapter 6 has a more detailed discussion of this issue and stra¬ 
tegies for coping, should it turn out to be a problem. 


5.10. Summary of Reclamation Algorithms 


Table 5.13 summarizes our results: both Deutsch-Bobrow deferred reference counting 
and Generation Scavenging perform well enough for an advanced personal computer. The 
advantages of Generation Scavenging over deferred reference counting are: 


4 



*j 


it reclaims circular structures, 


• it includes compaction, and 


• it uses less than a tenth of the total CPU time. 


5.11. Conclusions 


The combination of generation scavenging and paging provides high performance 
automatic storage reclamation, compaction, and virtual memory. This method of storage 
management has proven its worth daily in Berkeley Smalltalk, which has supported the 
SOAR compiler project, architectural studies, and text editing for portions of this chapter. 

The algorithm we have presented may not accommodate objects that live for a medium 
amount of time; they may increase die time overhead or cause thrashing. Measurements 
must be taken on high-performance Smalltalk-80 systems to understand die behavior of 
these objects. 


Table 5.15: Summary of reclamation stratecies. 


mam memory paging pause pause 

for dynamic I/Os time interval 



;e it, no reclamation 


immed ref. count 
(compaction) 
deferred ref. count 
(compaction) 


mark and sweep 
Ballard 


Generation Scavenging 

BS 

SOAR best case 
SOAR average 
SOAR wont case 


15% - 20% 


25% -40% 
7%* 


1900 KB 
2000 KB 


200 KB 
170KB 
170KB 
170KB 


0.025 1.1 
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(1*513. DeSM] Bel ted'i nor age reclamation algorithm may well exceed 7% overhead on a conpiled SmalltaUt-SO lyxtem. 
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High performance storage reclamation relies on two principles: 

• Young objects die young. Therefore a reclamation algorithm should not waste time on 
old objects. 

• For young objects, fatalities overwhelm survivors. Copying survivors is much cheaper 
than scanning corpses. 

Careful consideration of the virtual memory system is essential. Generation Scavenging 
combines these lessons to meet stringent performance goals: low time overhead (2% in BS, 
3% in SOAR), imperceptibly short pause times (160 ms in BS, 27 ms in SOAR), and a low 
page fault rate (1.2 faults/sec in BS). Meeting these goals costs 200 KB of primary memory, 
but the result is worth it; a high-performance computer system with fast automatic storage 


reclamation. 


Chapter 6 


Scavenging Data with Intermediate Lifetimes 

i.1. Introduction 

What happens if the age of an object fails to predict its lifetime? An object that sur¬ 
vives long enough to be promoted but succumbs shortly thereafter will waste storage in old 
space. This chapter contains a detailed description of the problem, how we have attacked in 
Berkeley Smalltalk, some proposals for extra generations, and an analytical model that sheds 
some light on the effect of various parameters on performance. 

4L2. The Tenuring Threshold 

When should Generation Scavenging tenure an object? Since we have observed that 
young objects are likely to die and old ones are likely to persist, our algorithm tenures an 
object that lives long enough. The easiest way to measure age is to count the number of 
scavenges an object survives. Thus, each object contains a byte that is initialized to zero and 
is incremented on each scavenge. If an object survives for a certain number of scavenges, it 
gets tenured. The problem is to choose this threshold. If it is too small, that is if Generation 
Scavenging tenures objects too soon, a large fraction of them will die shortly after receiving 
tenure. Tenured garbage wastes space on backing store, and more importantly, may slow the 
system with extra page faults by mixing dead and live objects on the same page. On the 
other hand, if the tenuring threshold is too high, long-lived objects will pile up in the new 
area, increasing the amount of data that must be copied for each scavenge. This will 
increase the pause time and the CPU overhead for storage reclamation. Thus, the tenuring 
threshold must balance the increase in page faults caused by tenured garbage against the 
extra pause time caused by scavenging long-lived objects. 















lit Berkeley Smalltalk, we have included a feedback-mediated adaptive algorithm to 
set the tenuring threshold. The algorithm examines the amount of data that survived the pre¬ 
vious scavenge and adjusts die tenuring threshold accordingly. The current implementation 
Emits die tenuring threshold to 64, where it remains most of die time. On SOAR, a tenuring 
threshold of 64 would mean that an object would have to survive for more than a minute to 
be tenured. Since the response time for most requests is much smaller than a minute, setting 
the tenuring threshold to 64 would allow Generation Scavenging to reclaim die bulk of the 
garbage online. 

We have performed an experiment with BS to better understand tenuring. Since the 
objects of concern are those that live for relatively long times, a typical interactive session of 
several hours duration would be ideal for characterizing tenuring behavior. Berkeley 
Smalltalk’s poor overall performance, 10% of a Dorado, prevented us from gathering dau 
from a typical interactive session. Lacking a Dorado or SOAR chip, we settled for a syn¬ 
thetic workload: our image merely ran the decompiler benchmark twenty times. The inter¬ 
val between scavenges was held fairly constant while varying the tenure threshold. A total 
of 20kw was allocated in die new area (plus 20kw for each survivor area). The feedback 
mediated scavenge algorithm used an average of 18.7 kw before each scavenge. Table 6.1 
gives our results. 

Figure 6.1 shows the relationship between die tenuring threshold and die number of 
bytes of data that were tenured. As expected, the number of objects achieving tenure 
decreases as die time required to obtain tenure increases. In addition, there are two knees in 
the curve — also just as expected. The first knee, at a tenure threshold of one, merely 
proves that most objects die very quickly. The reason is that a threshold of zero means that 
every object gets promoted—even though it may be only milliseconds old—but a threshold 
of one means that an object that gets promoted must be older than the time between 
scavenges. Since the scavenges occurred every 3.5 seconds, this knee shows that many 






Table 6.1: Results of BS teearieg experiment 


tenure 
threshold 
<♦ gs’s) 

#gs’s 

total 

time 

(secs) 

total 

tenured 

(kw) 

avg. 

surv. 

(kw) 

max 

surv. 

(kw) 

CPU time 
overhead 
(%)* 

< 0 

90 

340 

56.0 

23 

A3 

0.6% 

i 1 

83 

290 

17.0 

2.9 

4.3 

0.8% 

i 2 

83 

310 

16.9 

3.0 

43 

0.8% 

3 

83 

300 

16.7 

3 2 

43 

0.9% 

4 

83 

290 

3.7 

3.4 

4.8 

0.9% 

! 5 

83 

300 

3.7 

3.4 

4.6 

0.9% 

1 6 

83 

300 

3.9 

3.5 

4.6 

0.9% 

j 7 

83 

280 

3.7 

35 

4.7 

1.0% 

! 8 

83 

290 

3.6 

3.6 

4.8 

1.0% 

16 

83 

290 

2.9 

3.8 

4.9 

1.0% 

! 32 

83 

300 

2.4 

4 2 

6.9 

1.1% 

I 64 

83 

290 

2.0 

5.1 

6.4 

1.4% 


objects live less than 3.5 seconds. 

The second knee, at 4, indicates that many objects live for more than 3x3 3 seconds but 
less than 4x3.5 seconds. This is not surprising because each iteration of the benchmark took 
about 12 seconds, the only objects tenured at a threshold of 4, were those that survived for 
more than one iteration. These were die text lines printed on the screen from the bench¬ 
marks. This experiment confirms our understanding of tenuring; any object which outlives 
die product of the tenuring threshold and the inter-sc avenge time gets tenured. 

Although minimizing the amount of tenured data saves (virtual) memory space and 
improves paging performance, it forces the scavenge operation to copy more survivors, 
which takes more time. The surprise is how small this increase is. In this experiment, die 
quantity of tenured data—which is principally garbage—decreased by a factor of 23, while 
the time spent on scavenging merely doubled. 

Unfortunately, we would need measurements of a fast Smalltalk-80 system to com- 
pteiely predict the effects of tenuring. Tenuring affects objects that live for minutes or 
hours. These objects are used by people, not programs. For example, the objects that 
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Two generations with fast tenuring. This is the present configuration. Deutsch has 
estimated that data structures used by a typical window, for example a browser, con¬ 
sume 15 KB of memory. At 20 cycles per word, that means that it would take 30 ms to 
scavenge the dam for a window. Thus, assuming 150 KB of new space, every 
untenured window would add 3% to the scavenging overhead, limiting die number of 
untenured windows to about 4. If the rate of window creation is slow enough, a system 
that tenures objects so fast that every window gets tenured may be practical. On the 
other hand, if many windows are created and immediately destroyed (as in the case of 
error message windows) it may be important to retain a few untenured windows. 

Two generations with slow tenuring. Assume we dedicate a megabyte of physical 
memory to new objects. Then the system can run seven seconds between scavenges. 
That means that a more data can be scavenged without incurring incurring excessive 
overhead. In fact, the limit becomes the scavenge's pause time, not the percentage of 
overhead. Suppose that we accept a fifth-second pause every seven seconds. That is 
long enough to scavenge seven windows. This may be a sufficient number of 
untenured windows to avoid tenuring garbage. (Interestingly, seven is roughly the size 
of a human short-term memory.) 

Three generations with fast tenuring. Suppose we add a third generation in die middle. 
Some of the space for the third generation can be obtained by reducing the size of the 
youngest generation from 100KB to 50KB, which triples the scavenge overhead to a 
(still acceptable) 3%. A middle generation of 300KB of physical memory can contain 
ten untenured windows (in each semispace). The time for a scavenge of the middle 
generation would be about 300 ms. This option can support about the same number of 
windows as the two generation, slow tenuring one. but with slighdy more space and 
significandy less time overhead. 





















4. Three generations with mow tenuring. Suppose we add a large third generation, but 
use virtual memory instead of physical. Scavenging this middle-aged generation 
would then incur page faults and cause a perceptible pause, perhaps one to three 
seconds. However, 30 windows could be created before filling (tbe 1/2 MB semispace 
of) a one megabyte generation. Thus, these long scavenges would be infrequent, and 
acceptable. 

3. Four generations. SOAR's tags support four generations, so we could combine the 
above schemes. The youngest generation would be small, locked into memory, and 
frequently scavenged. An object surviving two scavenges would be promoted into die 
next generation. This would also be in physical memory, but larger. This generation 
would bold the newest few windows. Thus, this is important if many windows are 
closed immediately. The third generation, would be about a megabyte, and located in 
virtual memory. Most windows and medium lifetime objects would reside here. They 
could be reclaimed without a complete reorganization. Finally, permanent objects like 
the square-root routine would reside in die oldest generation, which would be 
reclaimed and reorganized offline. Table 62 summarized these proposals. More work 
is needed to measure die behavior of these medium lifetime objects and to design 
appropriate two- or three- generation parameters and reorganization algorithms. 


6J. Analysis of a Single Scavenged Generation 

How much physical memory must be dedicated to new objects? In this section we 
present an analysis of a two-generation system where one generation is scavenged (New) 
and die other is reclaimed offline (Old). Since the Old objects are reclaimed offline, we will 
only analyze the New generation here. Table 6.3 introduces the relevant terms. The first 
constraint we face is to keep the scavenge pauses small enough to be unobtrusive. The data 
on scavenging duration in the previous section showed that the length of a scavenge can be 
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Table 6.2: Summary of tenuring proposals, 
generation || assistant 1 associate full | ementus 

type of memory || physical virtual 


Proposal J. Two generations, fast tenuring. 


creation area 

(KB) 

140 



4,000 

gap tine 

(sec) 

1 



* | 

* 

survivor area 

(KB) 

17 



disk 

pause time 

(ms) 

30 



60 | 

scavenge time 

<%) 

3% 



? ! 

primary memory 

(KB) 

170 



2,000 

Proposal 2 

. Two generations, slow tenuring. 

creation area 

(KB) 

420 



4,000 ! 

gap time 

(sec) 

3 



? i 

: survivor area 

(KB) 

170 



disk j 

pause time 

(sec) 

0.30 



60 

scavenge time 

(*) 

10% 



? I 

primary memory 

(KB) 

760 



2,000 

Proposal J 

. Three generations, fast tenuring. 

creation area 

(KB) 

140 

0 


4,000 

gap time 

(sec) 

1 

600 


? ! 

survivor area 

(KB) 

17 

ISO 


disk 1 

pause tune 

(sec) 

0.030 

0.30 


60 

scavenge time 

(%) 

3% 

0.05% 


? 1 

i 

primary memory (KB) 

170 

300 


1 -3 MB 

Proposal 4. Three generations, slow tenuring. i 

creation area 

(KB) 

140 


0 

3,000 

gap time 

(sec) 

1 


2,000 

? 

survivor area 

(KB) 

17 


500 

disk j 

■pause time 

(sec) 

0.030 


'10 

60 

scavenge time 

(%) 

3% 


0.5% 

? 

primary memory (KB) 

170 


500 

0.5 - 2 J MB i 

Pro 

posal 5. Four generations. 1 

creation area 

(KB) 

140 

0 

0 

3,000 i 

gap time 

(sec) 

1 

600 

20.000? 

? , 

survivor area 

(KB) 

17 

150 

500 

disk ! 

pause time 

(sec) 

0.030 

0.30 

*10 

60 

scavenge time 

<%) 

3% 

0.05% 

0.05%? 

7 

primary memory (KB) 

170 

300 

500 

0.5 - 2.5 MB 


















Table 63: Quantities to analyze a single generation. 

i symbol 1 description 

units 


constants 


a 

SOAR cycle time 

seconds 

st 

scavenge effort: avg. cycles per scavenged byte 

cycles per byte 

abw 

allocation bandwidth: rate of new data instantiation 

bytes per second 

independent variables 

mrv 

size of each survivor area 

bytes 

Eden 

size of new object creation area 

bytes 


dependent variables 


! mem 

total memory used 

bytes 

■ pause 

length of scavenging pause 

seconds 

g°P 

gap between scavenges 

seconds 

ov 

fraction of CPU used for scavenging this generation 

fraction [0,1] 


predicted from the amount of data surviving the scavenge. 


pauit m(sexctpaurv (]) 

Let’s test this with an example. Plugging in typical SOAR parameters cr* 400ns, 
st ■ 5.5cyc tbyte , and sure ■ 8,800 bytes: 


pause »(5Jx400ni)x8,800» 19nw (IE) 

which matches the simulated pause time of 19 ms. 

Reducing die tenuring threshold will limit die quantity of data that survives a scavenge 
by promoting the oldest surviving objects. Once in Old space, they need not be scavenged. 
But, as discussed in the previous section, too much tenuring can provoke thrashing. Thus, 
we recommend choosing an acceptable pause time (perhaps from 10 ms to 100 ms) and 
adaptively adjusting the tenure threshold to maintain the corresponding amount of untenured 
data. 


The next step is to calculate the amount of memory devoted to newly-created objects. 
Let’s assume that the rate of object allocation is fairly constant. Then 


OTP 


Eden 

ab* 


( 2 ) 


For example, in the growth rate experiment in the previous section, we found that the com¬ 
piler benchmark generated 17,000 words per second. Thus, ab* * 68,OOObytes /sec . so for 




■I'MI 


Eden - 150.000, 


150.000 

WP * ~7l ~ rgr * 2 2sec 

68,000 


In other words, with 150 KB for new objects, SOAR could run for two seconds between suc¬ 
cessive scavenges. 


Although, ov » —£---=■—, we will use a simpler approximation, 
pause +gap 


pause 


for our analysis. (This is a reasonable approximation because we only care about systems 
with low overhead.) Continuing with our example, we can use equation (3) to calculate the 
nine overhead: 

1 Qpaa * 

ov * * 0.86% (3E) 

Zlsec 

Since we have expressions for die pause and gap times, we can combine (1), (2), and 
(3) to express the overhead in terms of memory allocations: 


Eden (sexctxabw) ’ 

Suppose we need to decide how much memory to allocate for Eden in SOAR: 

MOO w ov 
Eden 0.15 

Edenxov • 1300KB (4E) 

So, for 2% overhead, we would allocate 65 KB to Eden. This would total 
2x8600+65.000 * 82 KB of main memory for New objects. 

For the general case we can combine 

mem ■ Eden +2 xsun (5) 

with (4) to calculate the total memory required. Suppose we built the system as described 

above, only to discover that it tenures too much garbage. The first step to cut down on 

tenuring would be to boost the quantity of untenured survivors. This will increase the pause 
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time for a scavenge; equation (1) says that sun * - P?* 3 * , Thus, 50 KB of survivors will 

2x10 

result in pauses that last 100 ms. The increased pause time will drive up CPU overhead 
unless we dedicate more memory to Eden. Suppose we allow CPU overhead to rise to 5% to 
economize on memory, then equation (4) gives the size of the Eden area required. 


30,000 

Eden 

Eden 


0.05 
0.15 

50.000 

0.33 


> 0.33 


150,000 


Equation (5) then supplies the total memory for this generation: 


memory a 150.000+ 2x50,000 a 250,000 (5E) 

6.4. Analyzing a Middle Generation 

What if this is still not enough space for medium-lifetime objects? A third generation 
can be added in the middle. This results in a system with three generations: a generation for 
evanescent objects (Generation ]), a generation for medium-lived objects (Generation 2), 
and a generation for permanent objects (Figure 6.2). Assuming that we keep Generation 2 in 
primary memory, how are we going to divide memory among the two scavenged genera¬ 
tions? The equations in die previous section specify the behavior of a single scavenged gen¬ 
eration, so we can apply them to each of the two scavenged generations, using subscripts to 
indicate the generation. Then, by superposition from (4): 


OV »0V,+0V 2 * 


(se t xct t xabw t )sun i (se 2 xcr 2 xabw 2 )sun 2 


( 6 ) 


Eden, Eden 2 

For example, assume that each window uses 15 KB of data, and that we want to be able to 
support ten windows without tenuring. Then sun 2 * 15 0KB. If we open one window per 
KB 

minute, 2 * 15— * 250 byres / sec . (Se and ct are the same for both generations.) Thus. 


OV * OV , +OV , a 


1300 


74 


/«?\ 


















































































Continuing with our example, 

Eden , j 

Eden V 7 , 


> 81% and 


Eden 2 


Given an optimal split, we can plug ( 8 ) into (5) to find the minimum amount of over¬ 
head for a given amount of memory: 

ovxEden * ^l{se t xct,xabw ,)niv]+V(J* jxcf jxehw j)j«irv 2 j (9) 

For our example, 

ov xEden • pH00+V74 j* « 2000 (9E) 

So, for 2% overhead, 100 KB of Eden would be needed. Adding in the survivor areas, 420 
KB of physical memory would be used for scavenging. What about those long pauses for 
Generation 2? From ( 1 ), pause 2 * 150,000xse xcr * 300ms. From (5), 


0.19x100X8 


1 76 secs . Thus, by adding a middle generation, we have made 


it possible to scavenge more un tenured data by increasing the gap between long scavenges. 
This lets us keep 160 KB of untenured data in 420 KB of main memory at a time cost of 


2 . 0 %. 


We may decide that minimizing the total CPU overhead is not as important as reducing 
the frequency of long pauses. In that case, we can abandon ( 8 ) and use (1) and (2). Suppose 
we can only tolerate a 300 ms pause once every 3 minutes. Then, using (2) 
Eden j* 180x250 = 45X8. Assuming we use the same amount of memory as above, that 
leaves 55 KB for Eden v This results in a 0.81 second gap for Generation 1. With these 

19 300 

parameters the total overhead is — — = 2.5%. Of course, this is worse than the 

810 180,000 

optimal overhead of 2 . 0 %. 
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ii. Controlling the Tenuring Threshold 

Objects must be tenured to avoid excessive pauses caused by scavenging too much 
data. The problem is to set the tenure threshold given die survivors from die past generation. 
We propose that a scavenge also maintain a table giving the total amount of surviving data 




for each age. Such a table could then be used to predict the amount of data that would be 
promoted for any given tenure threshold. Building this table would add about 10% to the 
scavenge time. 




6 .6. The Cost of an Offline Reorganization 

To better understand die time required by an offline reorganization, we measured one 
on BS, on a diskless Sun 68010 workstation. Table 6.4 gives the results: this reorganization 
software is slow; 1200 memory cycles are expended in user mode on each word. Address 
space limitations of early Suns forced us to reorganize the old objects by copying them to a 
file, and modifying them in die file. Thus, every time a word is read from old space, a file 
read subroutine is called. Current Suns and SOAR have 16 MB of address space, more than 
enough to hold a copy of die 1 MB to 2 MB of old space. Replacing file read/write software 
with virtual memory hardware should result in a large speed up, and a sub-minute reorgani¬ 
zation seems feasible. 


Table 6.4: Measurements of an offline reorganization on BS. 

user time i 

116.7 

system time j 

46.1 sec 

real time j 

179 sec 

idle time 

16 sec 

CPU utilization 

90.9% 

reads 

464 

writes 

492 

page faults ! 

14 

initial old size 

243,036 words 

final old size 

231,207 words 

bandwidth 

480 jis'word 

16-bit cycles/word 

1200 


y.'rf 


&£. 
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<.7. Summary 


Objects that live long enough to be promoted but die shortly thereafter can present a 
problem for Generation Scavenging. To study this phenomonon, we would need data from 
sessions on high-performance systems using Generation Scavenging. Since we do not have 
die capability to perform these experiments, we have merely explored some solutions that 
can be adopted if necessary. The simplest strategy would be to tenuring threshold at a good 
compromise between time and space efficiency. If that did not suffice it might be necessary 


to add one or two more generations. 




Chapter 7 


Conclusions 


7.1. Conclusions 

. We have presented and evaluated the hardware and software design of Smalltalk On A 
RISC (SOAR). We undertook this effort to see how well the reduced instruction set com¬ 
puter style of system design would work for a software environment heretofore supported 
only by complicated virtual machines. It has worked very well indeed. A combination of 
hardware and software strategies has allowed us to build a single-chip NMOS microproces¬ 
sor that will match die performance of an ECL minicomputer, despite a 5:1 cycle time han¬ 
dicap. With about half of the transistors of the MC68010 microprocessor, a 400 ns SOAR 
wiU run die Smalltalk-80 system 25 times faster than the 400 ns MC68010. With only one 
fifth of the transistors of die MC68020, and with a handicap of about a factor of two in cycle 
time, SOAR win outrun the MC68020. RISCs pay off for experimental programming 
environments. 

SOAR’s performance comes at a price; namely, memory space. A bytecoded 32-bit 
Smalltalk-80 image occupies a megabyte of memory. Generation Scavenging adds 200 Kb 
to this, and compiling to a simple instruction set costs another 500 Kb. With current 
hardware technology, the extra 700 Kb is a small price to pay for high speed. 

The most important hardware features are register windows and tagged integer instruc¬ 
tions. These two features nearly double SOAR's performance by reducing the cost of sub¬ 
routine calls and type-checked integer operations. Other important hardware features 
include byte insert/exoact instructions, two-tone instructions, forwarding, one cycle jumps 
and calls, and tagged immediate data. In the realm of software, our storage management 
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strategies (discussed below), direct pointers, in-line caching, and compiling to a simple 
instruction set are essential In addition to permitting fast instruction decoding, the simpli¬ 




city of the base architecture enables us to add die language-specific extensions. 

On die other hand, despite our best intentions, we included several superfluous features 
in SOAR, including hardware support for storage reclamation, pointers to registers, parallel 
nilling, and shadow registers to aid trap handling. These are architect's traps because they 
increase design time and potentially increase the cycle time without appreciable reducing the 
number of cycles. These traps are baited with speedups for specific operations, and sprung 
when real programs fail to perform the optimized operations. 

We believe that the key to good performance is a willingness to migrate functionality 
from one level of abstraction to another, viewing the system as a whole rather than as a col¬ 
lection of layers. During the design process, we moved functions freely up and down the 
implementation hierarchy from software to silicon to achieve good performance with 
minimal hardware. For example, instead of interpretation, we have chosen to burden the 
software with compiling and debugging a simple instruction set that can be executed 
quickly. Also, we have replaced microcoded instructions for infrequent operations with 




software trap handlers. Our system was designed with an implementation technology in 
mind; this is the opposite of separating the architecture from the hardware implementation. 

We have developed an algorithm for automatic storage reclamation. Generation 
Scavenging, that permits SOAR to be the first full-speed Smalltalk-80 system without an 
object table. We have shown that, unlike many competing algorithms. Generation Scaveng¬ 



ing requires no hardware support. In addition, this algorithm reduces the time spent on 
storage reclamation to 3% of the CPU time. This is three times better than other 
Smalltalk-80 systems with comparable performance. Finally, unlike traditional 
reference-counting algorithms, Generation Scavenging can reclaim circular structures of 
dead objects. Automatic storage reclamation is no longer an important source of overhead. 
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SOAR represents a substantial improvement in cost-performance over previous 
Smalltalk-SO systems. We recommend that anyone faced with the task of building a com* 
puter for an exploratory programming environment consider compilation to a reduced 
instruction set 

71 Future Work 

At this date SOAR has been fabricated and, running at 800 ns., has successfully com¬ 
pleted all of its diagnostics [Pen85b]. An unforeseen critical path to memory needed by the 
fast shuffle hardware has increased its cycle time from 400 ns to S10 ns. Samples has ported 
the Smalltalk-80 system to the SOAR simulator; the system starts up and displays its win¬ 
dows on the screen. Our goal is to run the Smalltalk-80 system on SOAR. We will then 
measure the performance of die system to find any flaws lurking in our performance data. 
One of the most interesting remaining tasks is to construct a debugger for SOAR that pro¬ 
vides all the functionality of the current Smalltalk-80 bytecode debugger. A Smalltalk-80 
system running on SOAR with complete, source-level debugging facilities would demon¬ 
strate that the primitive level of the instruction set can be hidden from the user. Finally, 
Pendleton has proposed reimplementing a stripped-down SOAR with an optimized pipeline 
in a more advanced VLSI technology to yield a very fast SmalItaik-80 system. 

One aspect of Generation Scavenging remains in dire need of exploration: objects with 
an intermediate life span. If promoted too soon, they waste disk space and can degrade vir¬ 
tual memory performance. If promoted too late, they waste the CPU time needed to repeat- 
* edly scavenge them. Adding a third, middle generation is a possibility. Further research 
will require measurements of high-performance Smalltalk-80 systems with real users to 


obtain realistic actuarial data. 
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Appendix A 


Detailed Performance Evaluation of Individual Features 


A.1. Introduction 

This appendix contains detailed evaluations of the effectiveness of most of the features 
in SOAR and a few proposed additions to SOAR. The raw data, instruction mixes, and exe¬ 
cution time profiles on which these calculations are based are in Appendix B. To guide you 
through this section, we have reprinted pan of the table of contents in Table A.l. There are 
two kinds of subroutines in SOAR: subroutines written by Xerox in Smalltalk, and subrou¬ 
tines written by us in assembler for runtime support Since these are written in two different 
languages, they may have different instruction mixes. For this reason, our tables of dynamic 
data have three columns: one for the routines written in Smalltalk (ST), one for the routines 
written in assembler (system), and one that ignores the distinction (both). Since system code 
consumes two-thirds of the time, the averages (used in the other chapters) tend to be dom¬ 
inated by the behavior of the system code. If this code were optimized, the numbers for 
Smalltalk code would become more important for overall performance. For static measure¬ 
ments, the Smalltalk routines dwarf the assembler routines, and we usually omit die assem¬ 
bler ones. 

A1 Runtime Type Checking 

Runtime type checking distinguishes Smalltalk-80 systems from those designed for 
conventional languages. SOAR supports this with a tag bit for integers and tagged integer 
arithmetic and comparison instructions. 
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JL2.1. How Important are tbe Tagged Integer Instructions? 

To rapport tagged integers, SOAR includes tagged versions of the arithmetic and com¬ 
parison instructions. To assess their importance, we first measure their frequency of use, 
then calculate the performance degradation that would be caused by replacing them by 
equivalent software instructions. 

Ai.l.L Tagged Instruction Frequency 

Table A.2 lists the frequency of each tagged integer instruction for several bench¬ 
marks. Zero rows have been omitted. Table A.2 above shows, for compiled Smalltalk-80 
code, one out of every 8 instructions executed exploits SOAR's integer tag-checking 
hardware. Overall, tbe ratio is about 1 out of every 11 instructions. Interestingly, tagged 
skips outnumber tagged arithmetic in compiled code. 

Another way to measure frequency is to count the static number of each kind of tagged 
instruction. Table A3 shows that nearly 1 out of every 11 instructions is a tagged integer 
instruction. This is slightly lower than the dynamic frequency of 1 in 8. 

How often does SOAR detect an integer tag trap? As Table A.4 shows, these traps are 
quite rare; less than 4 in 1,000 tagged instructions trap. 

AJL1JL Cost of Omitting Tagged Arithmetic Instructions 

How much slower would SOAR be without integer tag checking hardware? Table A.S 
shows the sequences that would be needed without it. under tbe assumption that no compiler 
optimization is performed. (The feasibility of such optimization in the absence of type 
declarations has yet to be demonstrated.) Table A.6 summarizes these data with cost figures. 

The next step is to combine this cost data with the frequency data. Table A.7 lists the 
time cost of omitting each type of tagged instruction from SOAR. The benchmarks would 
take from 20% to 32% more time without integer tag checking hardware in SOAR. 





Table A3: 

Frequency of tagged arithmetic instructions, Part 1. 


ST 

system 

both 

ust3plus4 

all insts 

65.14% 

34.86% 

100% 

add 

33.07% 

0.00% 

2134% 

trap) 

0.00% 

6.17% 

2.15% 

loadc 

335% 

0.06% 

2.20% 

total 

36.42% 

635% 

25.89% 

testActivationReturn 

all insts 

9731% 

2.79% 

100% 

sub 

9.46% 

0.00% 

9.20% 

skip 

9.46% 

0.00% 

9.20% 

loadc 

9.46% 

0.00% 

9.20% 

total 

28.40% 

0.00% 

27.61% 


all insts 

41.06% 

58.94% 

100% 

add 

1.19% 

1.19% 

1.19% 

sab 

0.34% 

1.73% 

1.15% 

sll 

0.00% 

039% 

0.35% 

skip 

236% 

131% 

1.70% 

trap! 

0.00% 

2.49% 

1.47% 

load 

0.00% 

0.81% 

0.81% 

loadc 

733% 

0.10% 

3.03% 

total 

11.03% 

8.79% 

9.71% 

testCompUer 

all insts 

33.42% 

6638% 

100% 

add 

136% 

0.89% 

1.01% 

sub 

0.45% 

1.17% 

0.93% 

sll 

0.00% 

039% 

0.19%. 

skip 

1.94% 

0.87% 

1.23% 

trap! 

0.00% 

136% 

1.04% 

load 

0.00% 

1.02% 

0.68% 

loadc 

730% 

036% 

2.60% 


iota) 


10.92% 


6.07% 


7.69% 































Table A3: 

Frequency of tagged arithmetic instructions, Part 2. 


ST 

system 

bod) 

testDecompiler 

all insts 

32.19% 

67.81% 

100% 

add 

1.83% 

1.00% 


sub 

0.47% 

1.17% 

0.93% 

and 

0.09% 

0.00% 

0.03% 

sU 

0.00% 

0.10% 

0.07% 

sra 

0.00% 

0.16% 

0.11% 

skip 

2.52% 

0.62% 

1.23% 

trapl 

0.00% 

136% 

1.06% 

load 

0.00% 

1.12% 

0.76% 

loadc 

7.21% 

038% 

2.51% 

total 

12.08% 

6.00% 

7.95% 

testPrintDefinition 

all insts 

38.01% 

61.99% 

100% 

add 

2.26% 

1.37% 

1.71% 

sob 

0.08% 

2.69% 

1.70% 

skip 

431% 

0.02% 

1.65% 

trapl 

0.00% 

3.68% 

2.28% 

load 

0.00% 

236% 

139% 

loadc 

7.97% 

0.11% 

3.10% 

total 

14.65% 

10.44% 

12.04% 

ttstPrintHierarchy 

all insts 

2635% 

73.75% 

100% 

add 

2.10% 

0.26% 

0.73% 

sub 

0.23% 

0.84% 

0.68% 

skip 

231% 

0.05% 

0.70% 

trapl 

0.00% 

2.17% 

1.60% 

load 

0.00% 

1.45% 

1.07% 

loadc 

7.62% 

0.19% 

2.14% 

total 

12.46% 

4.98% 

6.94% 

Average of macro-benchmarks 

all insts 

34.19% 

65.81% 

100% 

add 

1.73% 

0.94% 

1.18% 

sub 

0.31% 

132% 

1.08% 

and 

0.02% 

0.00% 

0.01% 

sll 

0.00% 

0.20% 

0.12% 

sra 

0.00% 

0.03% 

0.02% 

skip 

2.71% 

0.57% 

1.30% 

trapl 

0.00% 

2.29% 

1.49% 

load 

0.00% 

1.39% 

0.98% 

loadc 

7.47% 

3.19% 

2.68% 


i total 


12.23% 


7.26% 


8.87% 
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Table AJ : Writcaround for t agged instructions. Part 1 . 

add A sub _ 

%or a, b, t; (omit for immediate) 

fekip In t. 1« 31 
jump error 
fedd/%aub a, b, c 
feor a. b, t 
feadt, 1 «31,t 

Map n» t, 0; (are eigne equal?) 

jump ok: (do! ia OK) 

%xor a, c, t 
fend 1.1« 31, t 

fekip eq t, 0; (overflow?) 


aad A or A xor 

%or 

%albp tru a, 1« 31 
jump «nor 
%*od/%o&%xor 

JU_ 

%akip Mu a, 1« 31 
jump nrar 
fell a. b, 
feor a, b, t 
feadt. 1« 31. t 
fekip eq t, 0; 

^ jump e rror 

_«1_ 

%akip Mu a, 1« 31 


fekip Mb a, 1« 31 


%an a. b 

fekip M a. 1« 30 
ferb, 1 « 30, b 


(overflow?) 


a,b,t;(oioaly) 


■ 

I 


(overflow?) 


»V-Y»>'.vV-’. O A-. V>V«V** • 
































































































Table A.7: 

Time cost of omitting tagged integer instructions. Part 1. 


ST 

system 

both 

test3plus4 




aU cycles 

5931 % 

40.43% 

100% 

add 

150.06%-300.12% 

0.00% 

89.40%-178.80% 

trapl 

0.00% 

13.26%-22.11% 

5.36%-8.94% 

loadc 

6.06% 

0.10% 

3.65%-3.65% 

total 

130.06%-330.12% 

13.36%-22.2l% 

94.76%-187.74% 

Performance relative to full SOAR (<100% is slower) 

51%-35% 

testActivationReturn 

all cycles 

95.91% 

4.09% 

100% 

sub 

35.30%-70.65% 

0.00% 

33.87%-67.75% 

skip 

21.19%-35.31% 

0.00% 

20.32%-33.87% 

loadc 

14.13% 

0.00% 

1355% 

total 

70.62%-120.08% 

0.00% 

67.74%-l 15.17% 

Performance relative to full SOAR (<100% is slower) 

60%-46% 

testC lassOrganizer 

all cycles 

4256% 

57.44% 

100% 

add 

3.99%-7.98% 

4J27%-854% 

4.15%-8.30% 

sub 

1.13%-2.26% 

6.19%-12.38% 

4.04%-8.08% 

sll 

0.00% 

2.59% 

1.49% 

skip 

4.61%-7.68% 

2.80%-4.67% 

3.57%-5.95% 

trapl 

0.00% 

5.40%-8.98% 

3.10%-5.16% 

load 

0.00% 

1.98%-2.98% 

1.14%-1.71% 

loadc 

9.80% 

0.14% 

4.25%-4.25% 

total 

19.54%-27.72% 

23.38%-40.20% 

21.74%-34.95% 

Performance relative to full SOAR (<100% is slower) 

82%-74% 

testCompiier 

all cycles 

34.07% 

65.93% 

100% 

add 

4.18%-8.35% 

3.05%-6.11% 

3.44%-6.87% 

sub 

152%-3.05% 

4.06%-8.12% 

3.20%-6.39% 

and 

0.03%-0.03% 

0.00%-0.00% 

0.0l%-0.01% 

sll 

0.00% 

1.17% 

0.77% 

sra 



0.01% 

skip 

3.90%-6.49% 

1.82%-3.02% 

2.52%-4.20% 

trapl 

0.00% 

3.22%-5.37% 

2.12%-354% 

load 

0.00% 

1.41%-2.12% 

0.93%-1.40% 

loadc 

9.77% 

0.35% 

3.56%-356% 


total _ 19.35%-27.65% 15.10%-26.28% 

Performance relative to full SOAR (<100% is slower) 


16.55%-26.74% 

86%-79% 








































Tabic A.7: Time cost of omitting tagged integer instructions. Part 2. 


1 

ST 

system 

both 

test Decompiler 

all cycles 

32.38% 

67.62% 

100% 

add 

6.29%-12.58% 

3.42%-6.85% 

4.35%-8.70% 

sob 

135%-3.09% 

4.00%-8.00% 

3.20%-6.41% 

and 

0.09%-0.15% 

0.00% 

0.03%-0.05% 

sU 

0.00% 

0.40% 

027% 

sra 

0.00% 

0.43% 

029% 

skip 

5.13%-8.52% 

1.29%-2.13% 

233%-421% 

trapl 

0.00% 

3.22%-5.37% 

2.18%-3.63% 

load 

0.00% 

134%-2.29% 

1.04%-135% 

loadc 

9.82% 

0.40% 

3.44%-3.44% 

total 

22.86%-34.16% 

14.68%-25.88% 

17.34%-2836% 

| Performance relative to full SOAR (<100% is slower) 

85%-78% 

j testPrintDefinition 

all cycles 

38.09% 

61.91% 

100% 

add 

830%-16.61% 

5.01%-10.02% 

6.26%-1233% 

sub 

0.25%-0.50% 

9.89%-19.78% 

6.22%-12.44% 

skip 

9.45%-15.78% 

0.03%-0.05% 

3.62%-6.04% 

trapl 

0.00% 

8.09%-13.49% 

5.01%-8.35% 

load 

0.00% 

3.78%-5.65% 

2.34%-330% 

loadc 

11.66% 

0.16% 

435%-435% 

total 

29.69%-44.55% 

26.95%-49.16% 

27.99%-47.40% 

| Performance relative to full SOAR (<100% is slower) 

78%-68% 


all cycles 


testPrintHierarchy 


74.10% 


add 

7.42%-14.85% 

0.89%-1.78% 

2.58%-5.16% 

sub 

0.82%-1.65% 

2.95%-5.89% 

2.40%-4.79% 

and 

0.04% 

0.00% 

0.01% 

sU 

0.00% 

0.03% 

. 0.02% 

skip 

5.37%-8.96% 

0.12%-0.20% 

1.48%-2.47% 

trapl 

0.00% 

436%-7.60% 

3.38%-5.63% 

load 

0.00% 

2.04%-3.06% 

1.51 %-2.27% 

loadc 

10.89% 

0.27% 

3.02%-3.02% 

total 

24.52%-36.34% 

10.84%-18.81% 

14.38%-23.36% 

Performance relative to full SOAR (<100% is slower) 

87%-81% 

















































Table A.7: Time cost of omitting tagged integer instructions, Part 3. 


ST system 

average of macro-benchmarks 


all cycles 

34.60% 

65.40% 

100% 

add 

6.04%-12.07% 

3.33%-6.65% 

4.1S%-831% 

sub 

1.05%-2.11% 

5.42%-10.84% 

3.81%-7.62% 

and 

0.03%-0.04% 

0.00% 

0.0l%-0.02% 

sll 

0.00% 

0.84% 

031% 

sra 

0.00% 

0.09% 

0.06% 

skip 

5.69%-9.49% 

Ul%-201% 

2.74%-4J7% 

trapl 

0% 

4.9%-8.16% 

3.16%-536% 

load 

0.00% 

2.15%-3.22% 

1.39%-2.09% 

loadc 

1039% 

0.26% 

3.76% 

total 

23.19%-34.09% 

18.19%-32.08% 

19.61 %-3231% 

Performance relative to full SOAR (<100% is slower) 

84%-76% 



Of course, eliminating tag checking hardware from SOAR would also incur a space 
cost for die extra checking instructions. Table A.8 combines die static cost data with die 
static frequency data to compute die code expansion resulting from omitting data tag check¬ 
ing hardware in SOAR. Again, we can ignore the system code because it is so small. The 
dam show that 38% mote instructions would be needed — about 15% of the total image. 


Table AJ: Static Cost of Omitting Tagged Arith Insts in System. 

(3502 instruction words) 

(493 data words) 

(3995 total words in sys) 

(168,581 SOAR words of compiled code & literals) 

(4,600 Smalltalk subroutines) 

(430,000 SOAR words total image) _ 


immediate? cost %code %code + data 


and yes , 120 0.07% 0.03% 

and no 396 0.23% 0.09% 

j or yes 4 0.00% 0.00% 

i or no 66 0.04% 0.02% 


add yes 7462 

add no 1 11320 i 

sub yes 4606 

sub no 8680 


0.07% 

0.23% 

0 . 00 % 

0.04% 








0.03% 

0.09% 

0 . 00 % 

0 . 02 % 
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By moving the tag check into hardware we have increased the cost for a tag exception. 
SOAR most take a trap to handle one. The data show that only 0.39% of tagged instructions 
trap, and that only 12J% of the instructions are tagged. Thus, a tag trap occurs once for 
every 2000 instructions. Since the tag trap handler prologue is about 25 instructions long, 
this represent a time cost of about 1.25%. 

To summarize, SOAR without hardware support for integer tag checking and with the 
same code generation strategy would run 24% slower and require about 150 KB more 
memory. 


A.2.2. Evaluating the Impact of Adding a Compare-and-Branch Instruction 

Instead of condition codes, SOAR uses conditional skip instructions. This simplifies 
handling comparisons of data that are not integers. The tag trap handler need not set condi¬ 
tion codes, but can merely return to the appropriate location. As a result, a conditional jump 
in SOAR takes two cycles: one for the skip instruction and another for the jump. This is as 
fast as it can be without an additional adder to compute jump address es . If we had such a 
device how much faster could SOAR run? To bound the number of times a conditional 
jump instruction would be used we can count skips. We can find a more accurate figure by 
counting only those skips that skip over unconditional jumps. Table A.9 present these data. 
The table shows that the most that could be hoped for is an 8% improvement. Counting only 
those skips that follow jumps results in a time savings of 2.6%. The large disparity implies 
that there are many places where the conditionally executed code is only a single instruction. 

For a static analysis, we counted the number of conditional jump sequences produced 
by die compiler (Table A.10). The table shows that little space would be saved. 
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Table A.9: Upper bound on speedup with compare-and-branch, Part 1. 

: 

ST 

system 

both 

ustClassOrganiser 

< instructions 

41.06% 

58.94% 

100% 

i cycles 

42.56% 

57.44% 

100% 

untagged skip’s per instruction 

1.57% 

12.39% 

7.95% 

tagged skip’s per instruction 

221% 

1.30% 

1.70% 

; total skip’s per instruction 

3.84% 

13.69% 

9.65% 

skip-jumps per instruction 

1.06% 

5.49% 

3.67% 

untagged skip's per cycle 

1.06% 

8.91% 

5.57% 

tagged skip’s per cycle 

1.53% 

0.93% 

1.19% 

total skip’s per cycle 

2.60% 

9.84% 

6.76% 

skip-jumps per cycle 

0.85% 

4.43% 

2.95% 

testCompiler 

> instructions 

33.42% 

66.58% 

100% 

; cycles 

34.07% 

65.93% 

100% 

untagged skip's per instruction 

1.50% 

15.57% 

10.87% 

i tagged skip’s per instruction 

1.93% 

0.88% 

1.23% 

total skip’s per instruction 

3.44% 

16.44% 

12.10% 

: skip-jumps per instruction 

1.37% 

5.78% 

4.30% 

untagged skip’s per cycle 

1.01% 

10.74% 

7.42% 

i tagged skip’s per cycle 

1.30% 

0.60% 

0.84% 

! total skip’s per cycle 

2.30% 

11.34% 

8.26% 

skip-jumps per cycle 

0.92% 

3.98% 

2.94% 

testDecompiler 

instructions 

32.19% 

67.81% 

100% 

cycles 

32.38% 

67.62% 

100% 

untagged skip’s per instruction 

6.72% 

17.56% 

12.14% 

tagged skip’s per instruction 

2.51% 

0.62% 

1.23% 

total skip’s per instruction 

3.23% 

18.18% 

13.37% 

skip-jumps per instruction 

1.29% 

4.63% 

3.56% 

untagged skip's per cycle 

0.49% 

12.07% 

8.32% 

tagged skip's per cycle 

1.71% 

0.43% 

0.84% 

total skip's per cycle 

2.20% 

12.50% 

9.16% 

skip-jumps per cycle 

0.88% 

3.18% 

2.44% 
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AJJ. Evaluating Two-Tone Instructions 



SOAR has two inodes of execution: tagged and untagged. Rather than putting a mode 
bit in die PSW and spending a cycle to switch inodes when needed, we put a mode bit in 
each instruction. Table A.11 shows how much slower SOAR would run if it took extra time 
to switch inodes. The table shows that SOAR would be 16% slower without two-tone 
instructions. 

To compute die code expansion, we instrumented the compiler. Table A. 12 analyzes 
these data. The table shows that the image would be 19% larger without two-tone instruc¬ 
tions. 


Table A.11: Projet 

cted time cost of manipulating PSW mode bit 



ST 

system 

both 

testClassOrganizer 

cycles 

cost of mode-setting 

; instructions 

42.56% 

17.86% 

57.44% 

19.30% 

100% 

18.69% 


testCon 

ipiler 



cycles 

cost of mode-setting 

; instructions 

34.07% 

18.52% 

65.93% 

12.68% 

100% 

14.67% 


testDecompiler 



cycles 

cost of mode-setting 

; instructions 

32.38% 

19.87% 

67.62% 

11.92% 

100% 

14.50% 


testPrintDefinition 



cycles 

cost of mode-setting 

; instructions 

38.09% 

20.53% 

61.91% 

20.35% 

100% 

20.42% 


restPrintHierarchy 



cycles 

cost of mode-setting 

: instructions 

25.90% 

21.74% 

74.10% 

9.93% 

100% 

12.99% 


_ average of macro-benchmarks _ 

cycles 34.60% 65.40% 100.00% 

cost of mode-setting instructions 19.70% 14.84% 16.25% 

_Table A.12: Space cost of mode bit in PSW. 

number of extra instructions to change PSW mode bit 70759 

image size _ 1,500 kB 

relative cost of PSW mode bit 18.87% 
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A3.4. How Important Are Tagged Immediate*? 

SOAR’s tagged i mme dia t e format crams tagged values such as ail, true, and false into 
a twelve-bit immediate field. Without this feature, a two-cycle load instruction would be 
needed to get a tagged value. Table A.13 analyzes die performance impact of this feature. 
For each benchmark, it gives the breakdown of cycles spent in Smalltalk vs. system code, 
then proceeds to give the percentage of immediates used requiring the tagged format, and 
finally, the time cost of omitting this feature. These data suggest that SOAR would be 10% 
slower without this feature. 

To analyze die impact of tagged immediates on die size of the compiled image, we 
instrumented our compiler (Table A. 14). As expected, non-negative integers dominate 
immediate values. Pointer immediates are also frequent Interestingly, boolean masks (all 
zeroes with a one in one of the top four bits, or tag values) provide a use for tagged immedi¬ 
ates more often than pointers. 

The next step is to count die number of immediates that would be unrepresentable 
without tagged immediates and determine the amount of further expansion in the image 
(Table A.15). Tagged immediates don’t save much space; the image would only be 1.2% 
larger without them. 


A3. Interpretation 

This section concerns features of SOAR’s instruction set and trap system. 


A3.1. Evaluating SOAR’s Byte Facilities 

We perform two comparisons: the speedup possible with load/store byte instructions, 
and die slowdown had we not provided the insen and extract instructions. Table A.16 gives 
the important instruction sequences: LoadByte and storeByte are slightly Aster than extract 
and insert, which in turn are much faster than relying on one bit shifts. 















tagged imms/all tmms 
ed imm cost/all cycles 





3238% < 

tagged ««mn</a !1 immc 

12.74% 1 

tagged imm cost/all cycles 

6.12% 1 


testPriruD 


10.78% ! 


cycles 

38.09% 

61.91% 

100% 

tagged imms/all imm? 

12.63% 

10.29% 

10.88% 

tagged imm cost/all cycles 

5.90% 

8.75% 

7.66% 


ttstPrintH i trarchy 


tagged imms/all unms 
ed imm cost/all cycles 


average of macro-benchmarks 


65.40% 


100 . 00 % 


tagged imm&'all muss 
ed imm cost/all cycles 
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1 Tabic A.14: Raw data for static analysis of Lagged immediates. 

imxnediar value 

count 

OK in 
SOAR 

OK w/o tagged ; 
immediates 

non-negative integers 

35106 

yes 

yes 

negative 31-bit integers 

7968 

yes 

yes* 

boolean masks 

2984 

yes 

no 

pointers 

2433 

yes 

no 

invalid! pointers 

8507 

no 

no 

invalidt integers 

868 

no 

yes* i 

total SOAR image site 

1500 kfi 


1 


[ Table A.15: Impact of eliminating tagged immediates. 

cost for pointers 

5417 immediates 

savings for integers 

868 immediates 

net cost 

4549 immediates 

relative cost 

121% i 


Table A.16: Codes sequences for byte operations* Part 1. 

(Byte 0 is least significant byte, byte 3 is most significant.) 

1 

Loading a byte from memory 

• loadByte 

load byte instruction (addition to SOAR) 

(base )offset + byteNo, dest 

. time 

2 cycles 

load 

extract 

extract byte instruction (current SOAR) 

(base )offset dest 
dest byteNo, dest 

time 

3 cycles 

no special instructions (simplification to SOAR) 
load (base)offset, dest 

: srl dest dest (0 to 24 of these) 

load pcRel(mask), maskReg (omit for byte 3) 

and dest maskReg, dest (omit for byte 3) 

mask: Oxff 

1 byte 0 time 
byte 1 time 
: byte 2 time 
byte 3 time 
| avg. time 

5 cycles 

13 cycles 

21 cycles 

26 cycles 

16 cycles 


* la outer to bo wnwbw. we awn ihet tht nef miv* inunediewt could be npmettad without U||ad istnodioiet 
by (titter choofioi the opcode to subtract tawed of add or, for offtcu. by oim| tht fail 32-bit rtpmcniation. We further u- 
te te ite that the mu/tn which are too bt| for our current ache me would fit ■ four more bit*. 

♦ Theae value* do not fit io SOAK'a ttf|td immediate format. 


■ J 'j'j • ! •■ -V* • . v.W>ji • . v -. v 
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Table A.16: Codes sequences for byte operational Part 2. 
(Byte 0 is least significant byte, byte 3 is most significant.) 

Storing a byte in memory 


store byte instruction (addition to SOAR) 

storeByte 

source, (base)offset + byteNo 

time 

2 cycles 


insert byte instruction (current SOAR) 

load 

(base )offset. dest 

load 

(base)offset, rl 

load 

pcRel(mask), maskReg 

and 

rl, maskReg, rl 

insert 

source, byteNo, r2 

or 

rl, r2. rl 

store 

rl, (base)offset 

time 

9 cycles 

no 

special instructions (simplification of SOAR) 

load 

(base )offset. rl 

load 

pcRel(mask), maskReg 

and 

rl, maskReg, rl 

sll 

source, source 

xor 

maskReg,-1, maskReg (omit for byte 3) 

and 

source. maskReg, source (omit for byte 3) 

or 

rl, source, rl 

store 

rl, (base)offset 

byte 0 time 

10 cycles 

byte 1 time 

18 cycles 

byte 2 time 

26 cycles 

byte 3 time 

32 cycles 

avg. time 

22 cycles 


Next in Table A. 17 we gather frequency data on insen and extract instructions, and 
multiply by the various costs to evaluate the performance impact of these other two schemes. 
As shown in die last section of Table A. 17, the average time savings for adding load/store 
byte instructions would be 1%, while the average time penalty for taking away the byte 
insen/extract instructions would be 33%. Byte insert/extract instructions seem to be a good 
compromise between functionality and efficiency. 


Table A.17: Dynamic analysis of byte operations, Part 1. 

tesrC lassOrganizer 



steps 

41.06% 

58.94% 

100% 

, cycles 

4236% 

57.44% 

100% 

! insen per inst 

0 

0.97% 

0.57% 

j extract per inst 

0 

334% 

2.09% 

! insen + extract per inst 

0 

431% 

2.66% 

1 insen per cycle 

0 

0.70% 

0.40% 

i extract per cycle 

0 

234% 

1.46% 

! insert + extract per cycle 

0 

3.24% 

1.86% 

' stoic byte savings 

0 

4.87% 


! load byte savings 

0 

234% 

1.46% 

: load & store byte savings 

0 

7.41% 

4.26% 


min insen omission cost 

0 

0.70% 


min extract omission cost 

0 

5.09% 

2.92% 

min insen/extract omission cost 

0 

5.78% 

3.32% 

avg insen omission cost 

0 

9.04% 

5.19% 

avg extract omission cost 

0 

33.07% 

18.99% 

avg insert/extract omission cost 

0 

42.11% 

24.19% 

max insen omission cost 

0 

16.00% 

9.19% 

max extract omission cost 

0 

5830% 

33.60% 

max insert/extract omission cost 

0 

7430% 

42.79% 

testCompiler 

steps 

33.42% 

6638% 

100% 

1 cycles 

34.07% 

65.93% 

100% 

: insen per inst 

0 

0.75% 

0.50% 

extract per inst 

0 

2.62% 

1.75% 

insen + extract per inst 

0 

3.37% 

2.24% 

insen per cycle 

0 

032% 


extract per cycle 

0 

1.81% 

1.19% 


0 

2.32% 

1.53% 

store byte savings 

0 

3.61% 

2.38% 

load byte ravings 

0 

1.81% 

1.19% 

load A. store byte ravings 

0 

5.41% 

3.57% 

min insen omission cost 

0 


0.34% 

min extract omission cost 

0 

3.62% 

2.38% 

min insert/extract omission cost 

0 

4.13% 

2.72% 

avg insen omission cost 

0 

6.70% 

4.41% 

avg extract omission cost 

0 

2331% 

15.50% 

avg insert/extract omission cost 

0 

30.20% 

19.91% 


0 

11.85% 

7.81% 

0 

4139% 

27.42% 

0 

53.43% 

35.23% 


max insert omission cost 
max extract omission cost 
max insen/extract omission cost 
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Table A*17: Dynamic analysis of byte operations. Part 2. 

testDecompiler 

steps 

32.19% 

67.81% 

100% 

cycles 

32.38% 

67.62% 

100% 

insert per inst 

0 

1.12% 

0.76% 

extract per inst 

0 

2.77% 

1.88% 

insert ■*■ extract per inst 

0 

3.89% 

2.64% 

insert per cycle 

0 

0.77% 

0.52% ; 

extract per cycle 

0 

1.91% 

1.29% 

insert + extract per cycle 

0 

2.67% 

1.81% 

store byte savings 

0 

5.37% 

3.63% 

load byte savings 

0 

1.91% 

1.29% 

load & store byte savings 

0 

7.28% 

4.92% 

min insert omission cost 

0 

0.77% 

032% 

min extract omission cost 

0 

3.81% 

238% ; 

min insert/extract omission cost 

0 

4.58% 

3.10% j 

avg insert omission cost 

0 

9.97% 

6.74% 

avg extract omission cost 

0 

24.78% 

16.76% 

avg insert/extract omission cost 

0 

34.75% 

23.50% 

max insert omission cost 

0 

17.65% 

11.93% 

max extract omission cost 

0 

43.84% 

29.65% 

max insert/extract omission cost 

0 

61.49% 

41.58% 

testPrintDefinirion 

steps 

38.01% 

61.99% 

100% 

cycles 

38.09% 

61.91% 

100% 

insert per inst 

0 

2.23% 

1.38% 

extract per inst 

0 

6.03% 

3.74% 

insert + extract per inst 

0 

8.26% 

5.12% 

insert per cycle 

0 

1.63% 

1.01% 

extract per cycle 

0 

4.42% 

2.74% 

insen + extract per cycle 

0 

6.06% 

3.75% 

store byte savings 

0 

11.44% 

7.08% 

load byte savings 

0 

4.42% 

2.74% 

load A store byte savings 

0 

15.86% 

9.82% 

min insen omission cost 

0 

1.63% 

1.01% 

min extract omission cost 

0 

8.85% 

5.48% 

min insert/extract omission cost 

0 

10.48% 

6.49% 

avg insen omission cost 

0 

21.24% 

13.15% 

avg extract omission cost 

0 

57.51% 

35.60% 

avg insert/extract omission cost 

0 

78.75% 

48.75% 

max insen omission cost 

0 

37.57% 

23.26% 

max extract omission cost 

0 

101.75% 

62.99% 

max insert/extract omission cost 

0 

139.32% 

86.25% 
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Table A.17: Dynamic analysis of bvte operations. Part 3. 


test PrintH ierarchv 


s 26.23% 73.75% 1009 


steps 

cycles 


insert per inst 
extract per inst 
insert + extract per inst 


insert per cycle 
extract per cycle 
insert + extract per cycle 


25.90% 


73.75% 

74.10% 


2.84% 

2.09% 

4.20% 

3.10% 

7.04% 

5.19% 


1.99% 

2.93% 

4.94% 


mm insert omission cost 
min extract omission cost 
min insert/extract omission cost 

avg insert omission cost 
avg extract omission cost 
avg insert/extract omission cost 

max insert omission cost 
max extract omission cost 
max insert/extract omission cost 


45.77% 

67.76% 

11334% 


10.32% 



% 


28.38% 

4735% 

33.92% 

50.21% 

84.13% 


average of macro-benchmarks 


insert per inst 
extract per inst 
insert -t- extract 


insert per cycle 
extract per cycle 
insert + extract per cycle 


store byte savings 
load byte savings 
load & store byte savings 


mm insert omission cost 
min extract omission cost 
min insert/extract omission cost 

avg insert omission cost 
avg extract omission cost 
avg insert/extract omission cost 

max extract omission cost 
max insen omission cost 
max insert/extract omission cost 


100 . 00 % 

100 . 00 % 


1.06% i 
231% i 


1.77% 

7.02% 


50.00% 

32.78% 

62.69% 

40.77% 

25.77% 

17.22% 

88.46% 

58.00% 








































































Table A.18: Loadc Time Analysis, Part 1. 
(All numbers are in percents.) 


benchmark 

Smalltalk 

system 

both 

testActtvationReturn 

steps 

97.21ft 

2.79ft 

100ft 

cycles 

95.91 ft 

4.09ft 

100ft 

loadc per inst 

9.47ft 

0.01ft 

9.20% 

loadc per cycle 

7.06ft 

0.01ft 

6.77ft 

loadc traps per loadc 

Oft 

Oft 

Oft 

cost of omitting loadc 

Oft 

Oft 

Oft 

testClassOrganizer 

steps 

41.06ft 

58.94% 

100% 

cycles 

42.56ft 

57.44ft 

100ft 

loadc per inst 

7.24ft 

0.10ft 

3.03ft 

loadc per cycle 

4.90ft 

0.07ft 

2.13ft 

loadc traps per loadc 

25.39ft 

Oft 

24.90% 

cost of omitting loadc 

2.49ft 

0% 

1.06% 

testCompiler 

steps 

33.42ft 

66.58ft 

100ft 

cycles 

34.07ft 

65.93% 

100ft 

loadc per inst 

7.29ft 

0.25% 

2.60% 

loadc per cycle 

4.89ft 

0.17ft 

1.78ft 

loadc traps per loadc 

15.41ft 

1.38ft 

14.52% 

cost of omitting loadc 

1.51ft 

0.00ft 

0.52ft 


testDecompiler 


steps 

cycles 


> 67.81ft 

100ft 

i> 67.62ft 

100ft 


loadc per inst 

7.20ft 

0.29ft 

loadc per cycle 

4.91% 

0.20ft 

loadc traps per loadc 

17.06ft 

0.16% l: 

cost of omitting loadc 

1.67ft 

0.00ft 1 


testPrintDefinition 


loadc per inst 
loadc per cycle 
loadc traps per loadc 
cost of omitting loadc 


- 61.99ft 

100ft 

p 61.91ft 

100ft 


testPrintHierarchy 

steps 

26.25ft 73.75ft 

100ft 

cycles 

25.90ft 74.10ft 

100ft 


loadc per inst 
loadc per cycle 
loadc traps per loadc 
cost of omitting loadc 




























































































































100 % 

100 % 


0.03% 

0 . 02 % 


testDecompiler 


! instructions 32.19% 67.81% 100% 

I cycles 32.38% 67.62% 100% 


l inl/outl uses per inst 0% 0.04% 0.03% 

cost of omittine inl/outl% 0% 0.03% 0.02% 


'_ testPrintDefinition _ _ 


instructions 

1 cycles _, 

inl/outl uses per inst 0% 0% 0% 

cost of omitting inl/outl % 0% 0% 0% 


_ testPrintHierarchy 


instructions 26.23% 73.73% 100% 

. cycles 25.90% 74.10% 100% 


: inl/outl uses per inst 0% 0.00% 0.00% 

cost of omitting inl/outl % 0% 0.00% 0.00% 


61.99% 

100% 

» 61.91% 

100% 





























AJ.i Evaluating SOAR’s Conditional Trap Instruction 

Conditional trap instructions can save one cycle for a comparison whose outcome can 
be predicted. Our SOAR software exploits the trap instruction to verify the in-line pro¬ 
cedure call cache, to check the tags of return values, and to test the types of arguments to 
primitive routines. Table A.22 shows die sequence that would be required without this 
instruction. Table A.23 shows the trap instruction dynamic frequency, and the time cost for 
omitting this feature from SOAR. Since the overhead is one cycle per trap instruction, the 
difference between die two numbers arises because the average instruction duration is 13 
cycles. The data show that SOAR would be 4 % slower without this feature. 

To analyze die impact of eliminating trap instructions on the siw of the compiled 
image, we instrumented our compiler to count trap instructions. Then assuming that each 
such instruction would become two instructions — a skip followed by a call — we can cal¬ 
culate the total impact (Table AJ24). Trap instructions improve image size even less than 
execution speed, and our image would only be 2% larger without them. 

AJ.7. One-Cycle Traps 

At one point in the design of SOAR, we decided to extend the trap operation rather 
dun len g t h en the cycle time [Pen85b]. This resulted in two-cycle traps instead of one-cycle 
naps. How many cycles did this decision cost us? Table A.25 presents our data. The result 
of adding the extra cycle to the trap operation was to require fewer than one percent more 
cycles. This was a good decision. 

Table A^2; Writearound for trap instruction. 
skip 
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Table AJ3: Time cost of omitting the trap instruction. 
(All numbers are percentages.) 

ST 


testActivatlonRerurn 



instructions 

97.21* 

2.79% 

100% 


cycles 

95.91% 

4.09% 

100% 


trap instructions per instruction 

14.20* 

0.02% 

13.80% 


cost w/o trap instruction 

10.59* 

0.01% 

10.16% 


testC lassOrganizer 




instructions 

41.06* 

58.94* 

100* 


cycles 

42.56% 

57.44% 

100% 


trap instructions per instruction 

943% 

3.53% 

5.99% 


cost w/o trap instruction 

6.44* 

2.54* 

4.20% 

1 

testCompiler 




instructions 

33.42% 

6648% 

100* 


cycles 

34.07% 

65.93% 

100* 


trap instructions per instruction 

9.38% 

2.35% 

4.70% 


cost w/o trap instruction 

6.28% 

1.62% 

3.21* 


testDecompiler 


average of macro-benchmarks 



instructions 

cycles 


trap instructions per instruction 
cost w/o trap instruction 


34.19* 

34.60* 

9.33* 

6.48* 


63.81* 

65.40* 

3.65* 
2.60* 


instructions 

cycles 

32.19% 

32.38% 

67.81* 

67.62* 

100* 

100% 

trap instructions per instruction 
cost w/o trap instruction 

9.31% 

6.35% 

2.51% 

1.73* 

4.70% 

3.22% 

testPriruDefinition 

instructions 

cycles 

38.01% 

38.09% 

61.99* 

61.91* 

100* 

100* 

trap instructions per instruction 
cost w/o trap instruction 

9.35* 

6.83% 

5.64* 

4.13* 

7.05* 

5.16* 

testPrintHierarchy 

instructions 

cycles 

26.25* 

25.90% 

73.75* 

74.10* 

100* 

100* 

trap instructions per instruction 
cost w/o trap instruction 

9.07* 

6.48* 

4.22* 

2.96* 

5.49* 

3.87% 


100 . 00 % 

100 . 00 % 

5.59* 

3.93% 
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Table A-25: 

Trap frequencies. Part 1. 


$T 

system 

both 


classOrganizer 


cycles 

42.56% 

57.44% 

100% 

TT’s per cycle 

1.53% 

0.00% 

0.65% 

WO’s per cycle 

0.53% 

0.05% 

0.23% 

WU’s per cycle 

0.43% 

0.13% 

0.18% 

TTs per cycle 

0.05% 

0.00% 

0.02% 

total traps per cycle 

2.54% 

0.18% 

1.08% 

compiler 

cycles 

34.07% 

65.93% 

100% 

TT’s per cycle 

0.91% 

0.00% 

0.31% 

WO’s per cycle 

0.56% 

0.09% 

0.19% 

WU’s per cycle 

0.51% 

0.12% 

0.17% 1 

TTs per cycle 

0.24% 

0.01% 

0.08% ! 

GS’s per cycle 

0.00% 

0.02% 

0.00% 

total traps per cycle 

2.22% 

0.24% 

0.76% ! 

j decompiler 

cycles 

32.38% 

67.62% 

100% 

TT’s per cycle 

0.92% 

0.00% 

0.30% 

WO’s per cycle 

0.34% 

0.08% 

0.11% 

WU’s per cycle 

0.37% 

0.07% 

0.12% 

TTs per cycle 

0.34% 

0.00% 

0.11% 

total traps per cycle 

1.98% 

0.15% 

0.64% 

| priiuDefinirioi 

n 


cycles 

38.09% 

61.91% 

100% 

TT’s per cycle 

0.76% 

0.00% 

0.29% 

WO’s per cycle 

0.04% 

0.02% 

0.01% 

WU’s per cycle 

0.05% 

0.02% 

0.02% 

TTs per cycle 

0.04% 

0.00% 

0.02% 

GS’s per cycle 

0.01% 

0.00% 

0.00% 

total traps per cycle 

0.90% 

0.03% 

0.34% 

printHierarchy 

cycles 

25.90% 

74.10% 

100% 

TT’s per cycle 

0.28% 

0.00% 

0.07% 

WO’s per cycle 

0.38% 

0.03% 

0.10% 

WU’s per cycle 

0.27% 

0.07% 

0.07% 

TTs per cycle 

0.28% 

0.00% 

0.07% 

GS’s per cycle 

0.08% 

0.00% 

0.02% 

total traps per cycle 

1.29% 

0.10% 

0.33% 



\ **. *■. 




total trai 






Table AJ5: Trap frequencies. Part 2. 

L .- 

ST 

system 

both 

average of macro-benchmarks j 

cycles 

0.00% 

0.00% 

100.00% ! 

TTs per cycle 

0.88% 

0.00% 

032% ! 

WOT per cycle 

0.37% 

0.05% 

0.13% | 

WUT per cycle 

0.33% 

0.08% 

0.11% ! 

TT’s per cycle 

0.19% 

0.00% 

0.06% | 

GS*s per cycle 

0.02% 

0.00% 

0.00% | 

i 

total traps per cycle 

1.79% 

0.14% 

0.63% j 


AJL8. Evaluating the Performance Impact of Shadow Registers 


To ascenain die tune cost of omitting shadow registers from SOAR, we measured the 
frequencies of the various types of traps, estimated the added cost of handling each type 
without shadow registers, and multiplied the two together. One trap we could not measure 
was the page fault trap. Handling a page fault takes so long though, that the few cycles 
saved by shadow registers will not make much difference. The traps we did include were: 
integer tag traps (TT) on ALU and load/store instructions, register window overflows (WO) 
on call instructions, register window underflows (WU) on return instructions, traps cause by 
conditional trap instructions (TT), and Generation Scavenge traps (GS) on store instructions. 
Of these, only tag and Generation Scavenge trap handlers profit from die shadow registers. 
Table A.26 summarizes our results. These data seem to suggest that shadow registers do not 
significandy improve performance. The maximum improvement is 0.12%. 
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Table A.26: Time cost of omitting shadow registers. 

(All figures in percents.) 


. ... _ 

ST 

system 

both 

testA ctivationReturn 


cycles 

95.91% 

4.09% 

100% 

shadow cost for GS 

0% 

0% 

0% 

shadow cost for TT 

0% 

0% 

0% 

shadow cost for both 

0% 

0% 

0% 

testClassOrganizer 

cycles 

42.56% 

57.44% 

100% 

shadow cost for GS 

0.00% 

0% 

0.00% 

shadow cost for TT 

0.12% 

0% 

0.05% 

shadow cost for both 

0.12% 

0% 

0.05% 

testCompiier 

cycles 

34.07% 

65.93% 

100% 

shadow cost for GS 

0.00% 

0.01% 

0.00% 

shadow cost for TT 

0.07% 

0% 

0.02% 

shadow cost for both 

0.07% 

0.01% 

0.03% 

testDecompiler 

cycles 

32.38% 

67.62% 

100% 

shadow cost for GS 

0% 

0% 

0% 

shadow cost for TT 

0.04% 

0% 

0.01% 

shadow cost for both 

0.04% 

0% 

0.01% 

testPrintDefinition 

cycles 

38.09% 

61.91% 

100% 

shadow cost for GS 

0.00% 

0% 

0.00% 

shadow cost for TT 

0.30% 

0% 

0.12% 

shadow cost for both 

0.30% 

0% 

0.12% 

testPrintH ierarcky 

cycles 

25.90% 

74.10% 

100% 

shadow cost for GS 

0.02% 

0% 

0.01% 

shadow cost for TT 

0.02% 

0% 

0.00% 

shadow cost for both 

0.04% 

0% 

0.01% 

average of macro-benchmarks 

cycles 

34.60% 

65.40% 

100.00% 

shadow cost for GS 

0.00% 

0.00% 

0.00% 

shadow cost for TT 

0.11% 

0.00% 

0.04% 

shadow cost for both 

0.11% 

0.00% 

0.04% 
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AJ.9. Docs SOAR Really Need Vectored Traps? 

Suppose the reason for a trap appeared is die PSW register. Then, the instructions in 
Table A 21 would simulate the effect of vectored traps. As the table shows, the cost would 
be four more cycles per trap. 

We can then estimate the overall performance impact by counting die number of traps 
that occur (Table A.28). Since this would presumably allow us to shorten our craps by a 
cycle, the table also lists die cost of the extra trap cycle in die current SOAR system. The 
able indicates that die new effect of non-vectored traps would be a 2.2% percent time 
penalty. 


A.4. Procedure Calls 


Next we examine SOAR’s features that help procedure calls. 


A.4.1. Evaluating SOAR’s Register File Organization 

Unlike other RISC*. the chips designed at Berkeley feature multiple overlapping 
on-chip register windows. These reduce the amount of raving and restoring for calls and 
returns. If this feature were left out of SOAR, then each call would have to save the registers 
it needed, and each return would have to restore the saved registers. To measure this 
hypothetical cost, assuming no compiler optimization, we counted die number of non-nil 
registers before each return instruction. This count of modified registers was then doubled to 
account for both die saving and restoring cost Finally, we added two cycles per return to 


Table A27; Simulating vectored traps. 
%jump 

%extract psw, 2, r_temp 

%ret f 
(jump table) 

Extra Cost 4 cycles 























Table A.28: Time cost of non*vectored traps. Part 1 

• 


Smalltalk 

System 

both 

testActivationReturn 

instructions 

97.21% 

179% 

100% 

time 

95.91% 

4.09% 

100% 

traps per instruction 

0.30% 

0.02% 

0.29% 

cost of extra trap cycie/all cycles 

0.22% 

0.01% 

0.21% 

cost of non vectored traps/all cycles 

0.89% 

0.04% 

0.85% 

testClassOrganuer 

instructions 

41.06% 

58.94% 

100% 

time 

42.56% 

57.44% 

100% 

traps per instruction 

3.75% 

0.25% 

1.69% 

cost of extra trap cycle/all cyc'es 

2.54% 

0.18% 

1.18% 

cost of non vectored traps/all cycles 

10.14% 

0.72% 


tesiCompiler 

instructions 

33.42% 

66.58% 

100% 

time 

34.07% 

65.93% 

100% 

traps per instruction 

3.31% 

0.35% 

1.34% 

cost of extra trap cycle/all cycles 

2.22% 

0.24% 

0.92% 

cost of nonvectored traps/all cycles 

8.88% 

0.97% 

3.66% 

testPecompiler 

instructions 

32.19% 

67.81% 

100% 

time 

32.38% 

67.62% 

100% 

traps per instruction 


0.22% 

1.08% 

cost of extra trap cycie/al) cycles 

1.98% 

0.15% 

0.74% 

cost of nonvectored traps/all cycles 

7.90% 

0.59% 

2.96% 


testPrintDefinition 


instructions 

38.01% 

61.99% 

100% 

time 

38.09% 

61.91% 

100% 

traps per instruction 

1.23% 

0.05% 

0.50% 

cost of extra trap cycle/all cycles 

0.90% 

0.03% 

0.36% 

cost of nonvectored traps/all cycles 

360% 

0.14% 

1.46% 

testPrintH ierarchy 

instructions 

26.25% 

73.75% 

100% 

time 

25.90% 

74.10% 

100% 

traps per instruction 

1.81% 

0.15% 

0.58% 

cost of extra trap cycle/all cycles 

1.29% 

0.10% 

0.41% 

cost of nonvectored traps/all cycles 

5.16% 


1.65% 




























Table AJ28: Time cost of non-vectored traps. Part 2 

• 

r - 

Smalltalk 

System 

both 

average of macro-benchmarks 

instructions 

34.19% 

65.81% 

100.00% 

time 

34.60% 

65.40% 

100.00% 

traps per instruction 

2.60% 

0.20% 

1.04% 

cost of extra trap cycle/all cycles 

1.79% 

0.14% 

0.72% 

cost of non vectored traps/all cycles 

7.14% 

0.57% 

2.89% 


account for the extra cycle of the loadm and storem instructions. Table A.29 presents these 
data. SOAR’s multiple register windows are the most significant architectural feature on the 
chip: The benchmarks would take 70% more time without them. 


How much would the image expand without register windows? The cost would be two 
instructions upon entering a subroutine (a subtract to adjust a stack pointer and a storem to 
save registers), and two instructions for each return from the routine (a loadm to restore the 
registers and an add to restore the sp). Table A.30 gives our analysis. 


AA2. Number of Registers per Window 

With only eight registers, SOAR's windows are much smaller than RISC II's. Meas¬ 
urements of Berkeley Smalltalk suggested that this would be sufficient. To verily this we 
instrumented our system and ran some benchmarks. When more registers are needed for a 
subroutine, it allocates a spill area in main memory. Thus, we merely counted the number of 
spill objects allocated and divided by the total number of calls. Also, we measured how 
many words were spilled to determine how many more registers were needed. Table A.31 
presents these data. These data show that SOAR’s windows are large enough for 
Smalltalk-80 programs; more than 97% of the subroutines called fit into a window. 


A.4J. Analysis of Loadm & Storem 


The first step in evaluating the impact of the load- and store- multiple instructions is to 
measure their frequency. Since the time to simulate one of these instructions depends on the 























































Table AJ9: Analysis of r 


PTTT 



ST 

«y* 

ttstDecompiler 

instructions 

32.19% 

67.81% 

i i 

32.38% 

67.62% 





instructions 


38.01% 


61.99% 
% 


cost of saving & restoring regs/all cycles 

cost of WOAJ 

net cost of no res file 


rf vs full SOAR 


36.17% 30.69% 


les 


retw’s* / all insts 
retw’s* / cycles 
avg regs used / retw* 



vs full SOAR 


8 . 68 % 

6 . 20 % 

4.01 





mzznzsnszzm 


retw’s* / all insts 
retw’s* / cycles 
avg regs used / retw* 
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I Table AJO: Static analysis of register windows. 

routine entry points 

4654 

routine exit points 

6795 

image size 

1500 kB 

relative cost 

6.11% 


Table AJ1: Spill area analysis. 

ustCompiler 

total number of cycles 
total number of Smalltalk calls 
number of calls using spill area 
total size of spill areas actually needed 

>1,100,000 

>18,000 

430 

883 

avg. words of spill area used 
fraction of calls needing spill areas 
mean number of cycles per spill allocation 

2.1 

2.3% 

2,600 

testDecompiler 

total number of cycles 
total number of Smalltalk calls 
number of calls using spill area 
total size of spill areas actually needed 

>2,900,000 

>46,000 

1085 

2807 

avg. words of spill area used 
fraction of calls needing spill areas 
mean number of cycles per spill allocation 

2.6 

2.4% 

2,700 



number of registers actually accessed, we also gathered those data (Table A.32). The loadm 
and storem instructions rarely occur, only one in 130 instructions. 

Table A.33 shows die performance consequences of eliminated this seldom-used 
feature. As expected from the frequency data, these instructions have minimal impact 
SOAR would be only 3% slower without them. 


How much larger would the compiled image grow if we eliminated loadm and storem? 
Originally, these instructions were intended only for die system code. In that case there 
would be no significant static impact However, our current strategy for spill areas requires 
a routine that allocates a spill area to initialize it We therefore instrumented our compiler to 
count the number of words initialized this way (Table A.34). (We also subtracted out the 
number of rem instructions used solely to write nil into several registers prior to the storem.) 
Omitting these instructions would increase the size of the system by only 2%. 
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Table AJ2: Loadm/storem execution frequencies, Part 1. | 


ST 

SYS 

both | 

testActrvationReturn 

instructions 

97.21% 

2.79% 

100% 

loadms per instruction 

0.00% 

5.19% 

0.14% 

loadms w/ 8 regs 

0.00% 

100.00% 

100.00% 

mean loedm regs 

0 

8 

8 

storems per instruction 

0.00% 

5.19% 

0.14% 

storems w / 8 regs 

0.00% 

100.00% 

100.00% 

mean storem regs 

0 

8 

8 

ttstC iassOrg an tier ; 

instructions 

41.06% 

58.94% 

100% 

loadms per instruction 

0.00% 

0.62% 

036% i 

loadms w/ 8 regs 

0.00% 

100.00% 

100.00% > 

mean loadmregs 

0 

8 

8 ! 

storems per instruction 

0.74% 

0.65% 

0.69% 

storems w/ 5 regs 

0.00% 

0.13% 

0.07% 

storems w/6 regs 

0.00% 

0.00% 

0.00% 

storems w/ 7 regs 

100.00% 

5.06% 

46.89% ! 

storems w/8 regs 

0.00% 

94.81% 

53.04% i 

mean storem regs 

7 

7.95 

733 i 

I testCompiler I 

instructions 

. 33.42% 

66.58% 

100% 

loadms per instruction 

0.00% 

0.67% 

0.45% 

loadms w/ 7 regs 

0.00% 

17.70% 

17.70% 

loadms w/ 8 regs 

0.00% 

82.30% 

8230% 

mean loadm regs 

0 

7.82 

7.82 

storems per instruction 

0.75% 

0.65% 

0.69% 

storems w/ 4 regs 

0.05% 

0.00% 

0.02% 

storems w/ S regs 

0.85% 

0.12% 

039% i 

storems w/ 6 regs 

2.72% 

0.00% 

1.00% 1 

storems w/ 7 regs 

96.38% 

1534% 

45.21% 1 

storems w/ 8 regs 

0.00% 

8433% 

5338% | 

mean storem regs 

6.95 

7.84 

732 
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Table AJ2: Loadm/storem executioD frequencies, Part 2. 

1 

ST 

SYS 

both 

testDecompiler 



instructions 

32.19% 

67.81% 

100% 

loadms per instruction 

0.00% 

0.33% 

024% 

loadms w/ 8 regs 

0.00% 

100.00% 

100.00% 

mean loadm regs 

0 

8 

8 

storems per instruction 

0.73% 

0.31% 

038% 

storems w/ 4 regs 

0.62% 

0.00% 

023% 

storems w/ 5 regs 

0.00% 

0.00% 

0.00% 

storems w/ 6 regs 

0.62% 

0.00% 

025% 

storems w/ 7 regs 

98.76% 

31.02% 

5835% 

storems w/ 8 regs 

0.00% 

68.98% 

41.15% 

mean storem legs 

6.98 

7.69 

7.40 

\ testPrintDefinition 

instructions 

38.01% 

61.99% 

100% 

loadms per instruction 

0.00% 

0.06% 

0.04% 

loadms w/ 8 regs 

0.00% 

100.00% 

100.00% 

mean loadm regs 

0 

8.00 

8.00 

storems per instruction 

0.00% 

0.14% 

0.09% 

storems w/ 5 regs 

0.00% 

2.13% 

2.13% 

storems w/ 6 regs 

0.00% 

0.00% 

0.00% 

storems w/ 7 regs 

0.00% 

55.32% 

5532% 

storems w/ 8 tegs 

0.00% 

4235% 

4235% 

mean storem regs 

0 

7.38 

738 

j UstPrintHierarchy 

instructions 

26.23% 

73.75% 

100% 

loadms per instruction 

0.00% 

0.27% 

020% 

loadms w/ 7 regs 

0.00% 

14.37% 

1437% 

loadms w/8 regs 

0.00% 

85-63% 

85.63% 

mean loadm regs 

0 

7.86 

7.86 

storems per instruction 

0.24% 

0.43% 

038% 

storems w/ 5 regs 

0.00% 

433% 

3.79% 

storems w/ 6 regs 

0.00% 

0.00% 

0.00% 

storems w/ 7 regs 

100.00% 

4131% 

51.10% 

storems w/ 8 regs 

0.00% 

53.96% 

45.11% 

mean storem regs 

7 

7.45 

7.38 
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Table A32: Loadm/storem execution frequencies. Part 3. 



ST 

SYS 

both 

avg of macros 

instructions 

34.19% 

65.81% 

100% 

loedms per instruction 

0% 


026% 

loedms w/ 7 regs 

0% 

6.41% 

6.41% 

loedms w/ 8 regs 

0% 

9339% 

9339% 

mean loadm regs 

0 

7.94 

7.94 

store ms per instruction 


0.48% 


store ms w/ 4 regs 

0.13% 

0% 

0.05% 

storems w/ 5 regs 

0.17% 

1.38% 

1.28% 

storems w/ 6 regs 

0.67% 

0% 

0.25% 

storems w/ 7 regs 

79.03% 

29.69% 

5137% 

storems w/ 8 regs 

0% 

68.93% 

47.05% 

mean storem regs 

539 

7.66 

7.44 






















Table AJ3: Time cost of omitting loadm & storem. 

_ (All costs in percents.) _ 

benchmark ST SYS both 

_ testActivationRerurn __ 

cycles _ 95.91% 4.09ft 100% 

loadm cost/all cycles Oft 18.23ft 0.75ft 

storem cost/all cycles Oft 18.23% 0.75ft 

total cost Oft 36.47ft 1.49ft 

” testClassOrganizer 


cycles 

42.56ft 

57.44ft 

100% 

loadm cost 

Oft 

3.11ft 

1.79% 

storem cost 

2.99ft 

3.26% 

3.14% 

total cost 

2.99ft 

6.37ft 

4.93% 

testCompiler 

cycles 

34.07ft 

65.93ft 

100% 

loadm cost 

Oft 

3.15% 

2.08% 

storem cost 

3.01ft 

3.08ft 

3.06% 

total cost 

3.01ft 

6.24ft 

5.14% 

testDecompiler 

cycles 

32.38ft 

67.62ft 

100% 

loadm cost 

Oft 

1.71% 

1.15ft 

storem cost 

2.98ft 

2.37ft 

251% 

total cost 

2.98ft 

4.07ft 

3.72% 

ttstPrintOefinition 

cycles 

38.09ft 

61.91ft 

100% 

loadm cost 

Oft 

0.30% 

0.19% 

storem cost 

Oft 

0.65% 

0.40% 

total cost 

Oft 

0.96ft 

059% 

test Print Hierarchy 

cycles 

25.90ft 

74.10ft 

100% 

loadm cost 

Oft 

1.31% 

0.97% 

storem cost 

1.02ft 

1.96% 

1.72% 

total cost 

1.02ft 

3.28% 

2.69% 

macro avg. 

cycles 

34.60ft 

65.40% 

100% 

loadm cost 

Oft 

1.92% 

1.24% 

storem cost 

2ft 

2.26% 

2.18% 

total cost 

2ft 

4.18% 

3.41% 


Table A34: Raw data for static analysis of store multiple. 

description _ count _ 

cost for storem _ 7363 words _ 

total SOAR image size 1500 kB _ 

relative static cost 1.96ft ~~~" ~ 





























A.4.4. Performance of Inline Caching 


First, we measured the cost of SOAR's in-line cache. In other words, if no procedure 
lookups were needed, how much faster could SOAR run? To evaluate SOAR’s in-line 
cache, we counted the occurrences of the cache probe conditional trap instruction. That gave 
us the number of probes. Then, since the prologue takes five cycles, we can easily get the 
probe time. For the misses, we a d ded two components: the miss trap handler time, obtained 
by multiplying the number of misses (trap instruction traps) by the trap handler path length, 
and the lookup time, obtained directly from an execution profile. Table A.3S summarizes 
these data, which show that in-line caching takes a lot of time; 23% of SOAR’s time is spent 
testing the cache and handling misses. Without any caching at all, the probe time would 
decrease to aero, but the miss time would increase by a factor of l/3.53%=28. In other words, 
what takes 100 seconds with in-line caching would take 100-10.88+12.46x28=438 seconds. 
SOAR would be four times slower with no cache at all. 

Next, we compared die 23% cost for the in-line cache with other caching schemes. 
One of these was die hash able cache found in interpretive Smalltaik-80 systems. The other 
scheme was an in-line indirect cache. Each call would jump through a per-process area with 
each process's cache entries. Table A.36 shows die code sequences needed for these two 
types of cache. The hash able cache is the most expensive scheme, requiring 23 cycles for a 
cache probe. SOAR’s in-line cache requires a prologue of only 3 cycles. The indirect 
scheme adds a cycle for the indirect call and one for an indirect load in the prologue for a 
total of 7. 

Assuming that the cache miss cost is independent of the caching scheme, we can use 
the cache probe frequency data to calculate the costs of these caching schemes (Table A.37). 
The bottom line in the table gives the average speed of the various schemes. SOAR would 
run only 73% as fast as it does now with a conventional hash table cache. In other words, 
the work that requires 100 cycles would take 133 with a conventional cache. 






















Table AJ5: Inline cache performance evaluation. 


description ST system 


testActivationReturn 


Parti 


both 


instructions 

cycles 


probes per inst 
probes per cycle 
loadc traps per probe 


probe insts per inst 
loadc trapH insts per inst 
be & trapH insts per inst 


probe cycles per cycle 
loadc trapH cycles per cycle 
miss trapH cycles per cycle 


probe & trapH cycles per cycle 
total miss time 


total cache time 


97.21% 

95.91* 


9.47* 

7.06* 

0* 

0 * 


28.40% 

0* 

28.40* 


35.32* 

0* 


2.79* 

4.09* 


1* 
0 . 01 * 
0* 
0* 


0.03* 

0 - 0 % 

0.03-0.03* 


0.03* 

0 - 0 * 


35.32* 0.03-0.03% 


100 % 

100 % 


9.20* 

6.77* 


27.61% 

0 - 0 * 

27.61-27.61* 


33.87* 

04 )* 

0 * 


33.87-33.87* 

0* 


33.87-33.87* 


probes per inst 
probes per cycle 
loadc traps per probe 
misses per probe 


testClassOrganizer 


;j 7.24* 
jj 4.90* 
H 25.39* 
0.96* 


58.94* 

57.44* 


0.05* 

0.04* 

CH)% 

0* 



probe cycles per cycle 
loadc trapH cycles per cycle 
miss trapH cycles per cycle 

24.48* 0.18* 

8.70* 0-0* 

0.14* 0* 

probe Sl trapH cycles per cycle 
total miss time 

33.18* 0.18-0.18* 

total cache time 


100 * 

100 * 


3.00* 

2 . 10 * 

25.15-25.15% 

0.95* 


9.01* 

2.27-2.27* 

11.27-11.27* 


10.52% 

3.70-3.70* 

0.06* 


14.22-14.22* 

2 . 66 * 


16 . 88 - 16 . 88 * 






































































































































Table A-35: Inline cache performance evaluation 

[, Part 3. 

description ST system 

both 

testPrintDefinition 

instructions i! 38.01% 61.99% 

cycles ii 38.09% 61.91% 

100% 

100% 


probes per inst 
probes per cycle 
loedc traps per probe 


7.98% 

5.83% 

1.03% 


0.73% 


3.06% 

2.24% 

1 . 02 - 1 . 02 % 

0.72% 


9.18% 

0.09-0.09% 

9.27-9.27% 


1121% 

0.16-0.16% 

0.05% 


probe & trapH cycles per cycle 

|i 29.59% 

0.15-0.15% 

11.37-11.37% 

total miss time 

i! 

l> 


1.95% 

total cache time 



13.31-13.31% 

testPrintHierarchy 

instructions 

;; 26.25% 

73.75% 

100% 

cycles 

‘ 25.90% 

74.10% 

100% 

probes per inst 

7.62% 

0.16% 

2.12% 

probes per cycle 

5.44% 

0.11% 

1.49% 

loadc traps per probe 

4.47% 

0-0% 

4.22-4.22% 

misses per probe 

5.13% 

0% 

4.84% 

probe insts per inst 

22.86% 

0.48% 

6.36% 

loadc trapH insts per inst 

1.02% 

0-0% 

0.27-0.27% 

probe & trapH insts per inst 

23.88% 

0.48-0.48% 

6.62-6.62% 

probe cycles per cycle 

27.20% 

0.56% 

7.46% 

loadc trapH cycles per cycle 

1.70% 

0-0% 

0.44-0.44% 

miss trapH cycles per cycle 

0.84% 

0% 

0.22% 

probe & trapH cycles per cycle 

28 . 90 % 

0.56-0.56% 

7.90-7.90% 

total miss time 

1; 


18.52% 

total cache time 

26.42-26.42% 



































































Table AJ5: Inline cache performance evaluation, 


description 


average of macro-benchmarks 


instructions j 34.19% 65.81% 

cycles |; 34.60% 65.40% 


probes per mst 
probes per cycle 
loadc traps per probe 
nusses per probe 


probe insts per inst 
loadc tmpH insts per inst 
probe & trapH insts per inst 


probe cycles per cycle 
loadc trapH cycles per cycle 


probe & trapH cycles per cycle 
total miss time 


total cache time 


7.47% 
5.1 
2.6 
3.7 


22.40% 

177% 

25.17% 


25.96% 

4.39% 


30.35% 


0.13% 

0.09% 

0.00-0.43% 

0 . 00 % 


0.40% 

0 . 00 % 

0.40% 


0.46% 

0 . 00 % 


0.46-0.47% 


Part 4. 


both 


100 . 00 % 

100 % 


2.64% 

1 . 86 % 

12.21-12.23% 

3.53% 


7.93% 

0.99% 

8.91-8.92% 


9.28% 

1.60% 


10 . 88 % 

12.46% 


23.34% 























































Table A 36: Code sequences for various caches. 


Hash-table Cache 


loadc 

%load 

%xar 

«load 

%and 

%sla 

%sla 

(rl4)classOffset, r6% 
(rl5)0, rS; sel 
r5, rt, r4% 
pcRel(misk), r3% 
r3, r4, r4% 
i4, t4% 
r4, i4% 

%load 

pcRel(base), r3% 

%add 

r3, r4, t4% 

%load 

(r4)c acheCI ass, r3% 

%uap3 

ne r3, r49fc 

%load 

(r4)cacheSel, r3% 

%trap3 

ne r3, r4% 

%load 

(r4)cacheTarget, r3% 

%ret 

r3,0% 

Time cose 23 cycles 


Indirect Inline Cache 


< indirect call> 


loadc 

(rl4)classOffset. 16% 

9bload 

(rl5)0, rS% 

«load 

(r5)rCacheBase. i5; uses global OR mapping 

%trap3 

ne rS, r6% 

Time cose 7 cycles + 1 cycle for indirect call 


SOAR Inline Cach e 
(rl4)dassOffset, it% 





































Table AJ7: Relative Performance of various caching schemes. 

_ (SOAR m f00%. faster is better.) _ 



no 

hash 

indirect 

SOAR 

aero time ; 

i 

1 

cache 

able 

inline 

cache 

resolution : 

j testAcdvabonRetum 

151.23% 

45.06% 

83.04% 

100 % 

151.23% : 

; testClassOrganizer 

28.13% 

72.53% 

91.86% 

100 % 

120.23% j 

• testCompiler 

25.05% 

76.10% 

93.03% 

100 % 

134.11% ! 

! testDecompiler 

23.35% 

76.57% 

93.28% 

100 % 

151.74% | 

: testPrintDefinirion 

28.58% 

71.26% 

91.67% 

100 % 

115.30% ! 

: testPrintHierarehy 

22.14% 

78.82% 

94.19% 

100 % 

135.51% | 

average 

25.45% 

75.06% 

92.81% 

100 % 

13138% ! 


Next we examine die space impact of these caching strategies. Table A.38 presents the 
raw data we have collected from the compiler. The total space taken by SOAR’s in-line 
caching scheme is the sum of the number of extra words needed to hold the last class for the 
sends (measured by the number of cache slots), and the space consumed by the method pro¬ 
logues. The number of prologues is the same as die number of cache probes. Table A.39 
illustrates this prologue. Table A.40 below shows the amounts of overhead at the call site 
and at the method prologue for the various caching schemes. Finally, we can combine this 
data to show the impact that each scheme would have (Table A.41). Thus, the hash able 
cache would save 1.24% of the image space 


Table AJ8: 

Raw data for static analysis of caching. 

call sites 

22025 

cache probes 

4654 

image size 

1500 kB 


Table AJ9: Inline cache prologue. 

< selector* 


needed to handle misses 

%loadc 

(rl4)0, ri) 

get receiver’s class 

%load 

(rl5)0. rl 

get last class for send 

%crapl 

ne r0, rl 

verify cache 








192 




Table A.40: Space overhead for the various cochins schemes. 



call site overhead p 

rologue overhead 

no lookups 

0 

0 

in-line cache 

1 

4 

indirect in-line cache 

3 

4 

hash table 

1 

0 


Table A.41: Net space impact of caching schemes. 

no lookups 2.71% savings 

in-line cache 0 

indirect in-line cache 2.94% cost 

hash table _ 1.24% savings _ 

A.4.5. How Fast Does SOAR Shuffle? 

SOAR is a nimble processor, jumps and branches only take one cycle. To understand 
the significance of this feature, we can examine the frequency of jumps and calk (Table 
A.42). As the table shows, jumps and calk are popular instructions; one instruction in 10 is 
a jump and one in 17 is a call. Given the frequency data, we can add the extra cycle SOAR 
would require without a fast shuffle (Table A.43). These data show that SOAR would be 
11 % slower without the fast shuffle mechanism. 

A.4.6. Evaluation of Parallel Regkter Initialization 

If the return instruction could write nil into six registers at once, each routine would 
have to write nil into its temporary variable registers sequentially. Using [Bla83a] page 139. 
Benchmark column, one can compute an average of 1.19 arguments and temporaries per 
call, excluding the receiver. Since the average number of arguments per call is 0.88 
fMeC83] (pp 185, Fig. 10.3) we assume that the average number of temporaries per call is 
between zero and one. This gives the number of extra cycles required per call. To measure 
die number of calls requiring nilling, we used the number of return instructions that changed 
the window. This way, we also included returns from interrupts. Table A.44 presents our 
measurement of the extra time that serial instead of parallel nilling would take, assuming no 








Table A.42: Frequency of jump and call 


ST 


instructions. 


ttsxA ctivahonPerurn 

instructions 

97.21% 

179% 

100 % 

jumps 

5.03% 

10.50% 

5.18% 

calls 

9.62% 

0.08% 

9.35% 


& calls 


A calls 


14.65% 10.58% 


testClassOrganiier 


29.62% 


usiC 


10 . 10 % 


iler 


A calls 27.99% 10.84% 16.57% 


14.53% 


instructions 

41.06% 

58.94% 

100 % 

jumps 

15.10% 

8.96% 

11.48% 

calls 

14.51% 

1.14% 

6.63% 


18.11% 


instructions 

33.42% 

66.58% 

100 % 

jumps 

1415% 

8.95% 

10.72% 

calls 

13.74% 

1.89% 

5.85% 


testDecompiler 


instructions 32.19% 67.81% 100% 


jumps 12.91% 8.66% 10.03% 

calls 13.23% 1.88%. 5.54% 


jumps A calls 26.14% 10.55% 15.57% 


testPrintDehnition 


instructions 38.01% 61.99% 100% 


jumps 

12.84% 

5.51% 

re 

calls 

13.50% 

1.89% 

SKI 


E 


mps A calls 26.34% 


jumps A calls 


7.40% 


testPrintHierarchy 

instructions 

2615% 

73.75% 

100 % 

jumps 

12.41% 

7.85% 

9.04% 

calls 

13.73% 

1.23% 

4.51% 


26.14% 

9.07% 

13-55% 

average of macros 

instructions 

34.19% 

65.81% 

100 % 

jumps 

13.50% 

7.99% 

9.91% 

calls 

13.74% 

1.61% 

5.77% 
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Table A.43: Cost of omitting fast shuffle. 


i 

ST 

system 

both 1 

| lestActivadoaReturn | 

cycles 

93.91% 

4.09% 

100 % 

jump cost 

3.73% 

5.27% 

3.82% 

call cost 

7.17% 

0.04% 

6 .88% 

total cost 

10.93% 

5.31% 

10.70% 

testClassOrganizer 

I cycles 

42.56% 

57.44% 

100 % 

! jump cost 

10 .21% 

6.44% 

8.05% 

call cost 

9.81% 

0.82% 

4.65% 

total cost 

20 .02% 

7.26% 

12.69% 

testCompiler 

cycles 

34.07% 

65.93% 

100 % 

jump cost 

9.55% 

6.18% 

7.32% 

call cost 

9.21% 

1.30% 

4.00% 

total cost 

18.76% 

7.48% 

11.32% 

testDecompiler 

cycles 

32.38% 

67.62% 

100 % 

jump cost 

8.80% 

5.96% 

6 .88% 

call cost 

9.02% 

1.30% 

3.80% 

total cost 

17.82% 

7.25% 

10.67% 

j testPrintDefinition 


38.09% 61.91% 


jump cost 

938% 

4.04% 

6.07% 

call cost 

9.87% 

1.38% 

4.61% 

total cost 

19.25% 

5.42% 

10.69% 


testrrintH it rare 


cycles 

25.90% 

74.10% 

100 % 

jump cost 

8 .86% 

5.50% 

6.37% 

call cost 

9.80% 

0 .86% 

3.18% 

total cost 

18.66% 

6.36% 

9.55% 

average of macro benchmarks 

cycles 

34.60% 

65.40% 

100 % 


call cost 

934% 

jump cost 

9.36% 

total cost 

18.90% < 



changes in compiler strategy. The data show that SOAR would run 4% slower without 
parallel nilling. 

To analyze the impact of parallel nilling on the size of the compiled image, we instru¬ 
mented our compiler (Table A.43). To do this, we kept a running total of the number of 
temporary variables that would be kept in registers. Assuming that each variable would 











































































































34.60% 

65.40% 

1 

00 .00% 

na. 

1.80 

na. 

0-1 

na. 

na. 

9.01% 

4.07% 


5.73% 

6.24% 

2.89% 


4.02% 

0.00%-6.25% 

5.05% 

3.27%-5.44% 


Tabic A.44: Evaluation of parallel nilling, Part 2. 


testrrintHierarckv 


avg. rcgs containing pointers per rerw* njL 

avg temp vars 0-1 

retw’s+a per inst 8.68% 

retw’s^a per cycle 6.20% 

cost of nilling 0%-6.20% 4.42% 3.28%-4.89% 


average of macro-benchmarks 


instructions 34.19% 65.81% 

es 


avg. regs containing pointers per netw* n.a. 1.80 

avg temp vars 0-1 na. 

retw’s+a per inst 9.01% 4.07% 

retw’shaper cycle 6.24% 2.89% 

cost of nillin 


require an additional instruction to nill it, we can then compute the space overhead nilling 
would require without hardware support. The table shows that our image would be 1.29% 
larger if SOAR lacked this feature. 

A.4.7. Return Options 

The inclusion of three optional operations in SOAR’s return instruction add some com¬ 
plexity to the architecture. Which of the possible combinations are really used? Table A.46 
shows our dynamic frequency data. As expected, the normal return, remw was used nearly 
dune quarters of the time. Although seven out of the eight possible versions were actually 
used, only ret, reti, rerw, and remw are essential, the rest could be omitted. The other 10% 


Table A.45: Static analvsis of parallel nillin 


nilling cost for temporary variables 2348 
nilling cost for spill initialization 2472 


total SOAR image size 1500 kB 


relative static cost to nil temps 0.63% 

relative static cost to nil spill obi. 0.66% 


total static cost for serial nilling 1.29% 
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Table A.46: Dv-iarnic frequency of return options. Part 1 


testActivationReturn 

returns per instruction 

9.78% 

returns per cycle 

7.20% 

%rcti’s per return 

1.48% 

%retn’s per return 

1.48% 

%retnw’s per return 

0 .01% 

retnw's per return 

9534% 

%retiw’s per return 

1.48% 

testC lassOrganizer 

returns per instruction 

8.03% 

returns per cycle 

6.46% 

%ret’s per return 

4.72% 

%reti’s per return 

12.59% 

%retn*s per return 

5.90% 

retn’s per return 

0.03% 

%retw’s per return 

2.26% 

retw’s per return 

0.48% 

% retnw’s per return 

11.92% 

retnw’s per return 

58.20% 

%retiw’s per return 

3.90% 

ttstCompiler 

returns per instruction 

8.18% 

returns per cycle 

5.59% 

%ret’s per return 

3.91% 

%reti’s per return 

11.78% 

%retn’s per return 

9.24% 

retn’s per return 

0.13% 

%retw’s per return 

1.58% 

retw’s per return 

0.53% 

%retnw's per return 

16.07% 

retnw’s per return 

52.16% 

%retiw’s per return 

4.48% 

%retinw’s per return 

0 .12% 






























Table A-46: Dynamic frequency of return options, Part 2. 

testDtcompiler 

returns per instruction 

7.38% 

returns per cycle 

5.06% 

%ret’s per return 

4.73% 

%reti’s per return 

11.37% 

%rem's per return 

8.77% 

retn's per return 

0.36% 

%retw’s per return 

0.55% 

retw's per return 

0 .02% 

%retnw’s per return 

13.33% 

remw’s per return 

57.61% 

%retiw's per return 

3.26% 

j testPrintDefuiition 

returns per instruction 

7.84% 

returns per cycle 

5.74% 

%ret’s per return 

8.45% 

%reti*s per return 

5.87% 

%rem’s per return 

1.90% 

%retw’s per return 

4.74% 

%remw’s per return 

11.48% 

remw’s per return 

67.08% 

%retiw’s per return 

0.47% 

testPrintHierarchy 

returns per instruction 

5.68% 

returns per cycle 

4.00% 

%ret*s per return 

5.29% 

%reti's per return 

7.18% 

%rem’s per return 

7.76% 

rem’s per return 

0.17% 

%retw’s per return 

1 .02% 

%retnw’s per return 

12.84% 

remw’s per return 

62.64% 

%retiw’s per return 

3.04% 

%rednw*s per return 

0.06% 



































of die returns would just require an extra cycle or two to synthesize. Since a return only 
occurs about ooe in twenty cycles, the effect would be to add a cycle or two every 200 
cycles. This would degrade performance less than 1%. 

Ai. Storage Management 

This section contains an evaluation of SOAR’s features to help manage storage. 

Ai.l. Evaluation of the Generation Scavenge Tag Checking Hardware 

The first step in understanding die performance impact of eliminating tagged store 
instructions from SOAR is an execution frequency measurement (Table A.47). At 0.36%, 


tagged store frequency 
ST system both 


torelnstVar 81.28% 18.72% 


41.06% 38.94% 

33.42% 66.58% 

32.19% 67.81% 

38.01% 61.99% 

26.25% 73.75% 
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The second step is to examine the cost of doing the check in software (Table A.48): simulat¬ 
ing this feature takes four cycles. The number of tagged stores executed per cycle can then 
be multiplied by the simulation cost (Table A.49). The result of this calculation is that the 
worst-case macro-benchmark would run only 3% slower without this feature. 

Next we examine the space cost of eliminating the generation tag checking hardware. 
Table A.50 gives the static frequency of these store instructions. As expected from the rarity 
of execution, tagged stores account for very little of the code, or about 2%. 

Finally, we multiply die 3 word space penalty by the static frequency (Table A.51) to 
compute that the Smalltalk-80 image would grow by only 3% if tagged stores were removed 
from SOAR. 


I Table A.48: Writearound for tamed stores. 

%siore 

(a)i, b 

%and 

a. Oxf « 28, ta 

%and 

b, Oxf « 28, tb 

% trap 

It ta, tb; crap if a younger 

*trap 

eq ta. Oxf; trap if a is a context 

dynamic cose 

4 cycles 

static cost 

4 words 


Table A.49: Time cost of omitting GS Tag Trap Store. 

_ ( c k of total cycles) _ 

)| all cycles ; store cost cycles 


benchmark j 

| ST 

i ' testP 

opS 

tore Inst Var | 


% 


system | ST 


system both 











































VL5.2. Frequency of GS traps 


One last interesting measurement is the cost of the Generation Scavenging trap. Table 
AJ2 gives the frequency of store traps. These data indicate that only 3.9% of the tagged 
stores trap. Since the path length for the store trap handler is 40 cycles (including the code 
to renumber the object), the time spent handling these traps is 

40^x3.^ —> xo.36% x 37% . 

trap taggedstore instruction 1.5 cycles 

The tiine for store traps is insignificant. 

AiJ. Evaluating the Pointer to Register Support 

The pointer-to-register circuitry includes a comparator and a significant amount of con¬ 
trol complexity [Pen85b]. How well could SOAR get along without it? There are two cases 


:•£ 

• " » 


,vN 

v ”j 


i 

& 


cr. 


t. 

1 


to analyze: 
thisContext 



In Smalltalk-80, a routine can request a pointer to its activation record by accessing the 



Table XS2: Dynamic flrequencv of tagged store GS traps. 

(Given as percentage of ST. system, both tagged stores executed.) 

benchmark 

ST 

system 

both 

testPopStorelnstVar 

0 % 

0 % 

0 % 

testClassOrgamzer 

0.30% 

0 % 

0.24% 

testCompiler 

0.24% 

4.83% 

1 2.71% 

testDecompiler 

0 % 

0 % 

0 % 

testPrintDefini con 

2.63% 

0 % 

2.63% 

testPrin tHierarc hy 

21.05% 

0 % 

* 13.79% 

avg macros 

4.84% 

0.9.7% 

3.87% 
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pseudo-variable thisContext. In this case, the compiler must give out an illegal 
(unmapped) address. When the program tries to use this address, the page fault 
handler can then ensure the activation record resides in memory and not on-chip, then 
complete the operation. Fortunately, this case mostly occurs in the debugger, where a 
speed penalty is more acceptable. 

blockCopy 

A Smalltalk-80 block permits execution of a piece of code in one procedure to be con¬ 
trolled by another procedure. We implement this feature with a distinct activation 
record that contains a pointer to the defining activation record. Thus, the code in a 
block can access the data in its home activation record with loads and stores. If we 
eliminate the pointer-to-register circuitry from SOAR, we merely need re flush a 
block’s home activation record out to memory when entering the block. This may 
involve flushing extra register windows until we reach the desired one. On the other 
hand, die desired window may already be in memory. We ran the benchmarks and 
simulated the cost of this scheme. Every time control entered a block, we counted the 
number of windows that would have to be flushed. The first column of Table A.53 
give the number of block invocations, and the second gives the average number of 
windows flushed per invocation. We have assumed an 18 cycle cost re flush a win¬ 
dow; nine cycles to save it, and another nine to restore il This estimate is probably 
low since it omits the cost of handling the extra traps. The third column, which is the 
cycles spent flushing windows per invocation, is just 18 times the second. The next 
two columns give the frequency of block invocations per cycle in compiled Smalltalk 
code, and the cost of simulating pointer-to-register per cycle in compiled Smalltalk 
code. Finally, the last two columns give the same data, but relative to the total time, 
not just the time executing compiled code. These data show that SOAR would be only 
3% slower without the pointer-to-register feature. 
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Table A.53: Time cost of eliminating pointer-to-register hardware. 


block 

invoks 

windows/ 

invok 

cycles/ ; 
invok 

values/ 

ST cycle 

cost/ 

ST cycle 

values/ 

cycle 

cost/ 
cycle 

4023 

0.92 

16.6 { 

0.29% 

4.89% 

0.13% 

2.08% 

906 

0.50 

9.0 i 

0.24% 

2 .20% 

0.08% 

0.75% 

2785 

1.40 

25.2 

0.30% 

7.49% 

0 .10% 

2.43% 

149 

2.02 

36.4 

0.53% 

19.2% 

0 .20% 

7.31% 

152 

1.30 

23.4 i 

0.50% 

11 .68% 

0.13% 

3.02% 

1603 

1.23 

22.1 ! 

0.37% 

9.09% 

0.13% 

3.12% 




benchmark 

dassOrganizer 

compiler 

decompiler 

printDefinition 

printHierarchy 

average_ 


A.6. Implementation 

We have examined two implementadon-related issues: eliminating register forwarding 
and the relative proportions of data- and instruction-fetches. 

A.6.1. Register Forwarding 

How important is the register forwarding in SOAR's datapath? To get a crude idea, we 
measured how often our simulated instructions used a forwarded value and assessed a 
penalty of one cycle. Table A.54 shows the results of this measurement- Forwarding is 

_ Table AJS4: Time cost for eliminating forwarding. _ 

_ testC lassOrganizer _ 

! cycles 42.56% 57.44% 100% 

' extra time for pipeline bubbles 9 .72% 14,02% 12.19% 

_ testCompiler _ 

; cycles 34.07% 65.93% 100% 

extra time for pipeline bubbles 10.26% 14.67% 13.17% 

testDccompiler 

cycles 32.38% 67.62% 100% 

extra ti mejorp ipeline bubbles 10 .66% 16.88% 14.86% 

testPnnt Definition 

cycles 38.09% 61.91% 100% 

extra time for pipeline bubbles 9.81% 21.31% 16.93% 

_ testPrinrHierarchy _ 

cycles 25.90% 74.10% 100% 

: extra time for pipeline bubbles 10.39% 21.22% 18.41% 

average of macro-benchmarks 

I cycles 34.60% 65.40% 100.00%- 

extra time for pipeline bubbles 10.17% 17.62% 15.11% 
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Tabic AJ5: Instruction vs. Data Fetches, Part 1. 


ST 




all instruction references 

65.14% 

34.86% 

100 % 

all data references 

32.08% 

67.92% 

100 % 

all data * instruction references 

61.15% 

38.85% 

100 % 

I-fetcbes per cycle 

90.73% 

71.56% 

82.98% 

1 -flusbes per cycle 

3.15% 

9.33% 

5.65% 

D-fetches per cycle 

6 .12% 

19.11% 

11.37% 


testActivanonRerum 





* Our simulator comport a value of -0.24% for this entry, clear evidence that ottr instruction c 




S-tO. 


































both 


Table A-55: Instruction vs. Data Fetches, Part 2. 


ST system 


testDecompiler 

all instruction references 

32.19% 

67.81% 

100 % 

all dam references 

33.27% 

66.73% 

100 % 

all data + instruction references 

32.42% 

67.58% 

100 % 

1 -fetches per cycle 

68.17% 

68.76% 

68.57% 

1 -flushes per cycle 

12.57% 

12.75% 

12.69% 

D-fetches per cycle 

19.26% 

18.50% 

18.74% 


testPrintDefinition 


all instruction references 

38.01% 

61.99% 

all data references 

36.82% 

63.18% 

all data + instruction references 

37.78% 

62.22% 

1 -fetches per cycle 

73.08% 

73.33% 

I-flushes per cycle 

10.32% 

9.14% 

D-fetches per cycle 

16.61% 

17.53% 


testPrintHierarchy 

all instruction references 

26.25% 

73.75% 

100 % 

all daa references 

23.28% 

76.72% 

100 % 

all daa + instruction references 

25.62% 

74.38% 

100 % 

1 -fetches per cycle 

71.39% 

70.11% 

70.44% 

I-flushes per cycle 

11 .66% 

10.36% 

10.70% 

D-fetches per cycle 

16.95% 

19.53% 

18.86% 


e of macro-benchmarks 


all instruction references 

all dam references 

all data + instruction references 


34.19% 

65.81% 

100 .00% 

33.57% 

66.43% 

100 .00% 

34.06% 

65.94% 

100 .00% 

71.88% 

72.40% 

72.18% 

9.86% 

8.64% 

9.07% 

18.26% 

18.96% 

18.75% 


1 -fetches per cycle 
1 -flushes per cycle 
D-fetches per cycle 






























Appendix B 


Ran- SOAR Data 


B.1. Introduction 

This appendix contains the raw data we gathered and used for the calculations in 
Appendix A. The first section contains instruction mixes for die second iteration of several 
benchmarks. These were run in an image that was modified to eliminate almost all 
occurrences of the become primitive, as outlined in Chapter 5. The second section contains 
execution time profiles for the same runs. To guide die reader through this section, we have 
reprinted pan of the able of contents in Table B.l. 
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B1 Instruction Mix Data 




condJumps: the number of times a jump immediately followed a skip. 


Table Bi: test3p)us4 Micro-Benchmark Instruction Mix. 


system both ! 


Steps 4642 • 

3332 2261 5393 i 


































Y 


Table BJ: testPoDStoreInst»ceVariable Micro-Benchmark Instruction Mix. 


ST system bo 


%ex tract 
% insert 


Table B.4: testActivatioutfeturn Micro-Benchmark Instruction Mix. 


system both 


Steps 

Cycles 

463922 

19772 

356067 

483694 

WO 

SIS 

0 

515 

wu 

S13 

2 

515 

Ccodes 

3 

0 

3 

%nop 

1 

515 

516 

%ret 

0 

1 

1 

%iem 

1 

515 

516 

%retnw 

2 

3 

5 

retnw 

33280 

5 

33285 

%red 

0 

515 

515 

%redw 

0 

515 

515 

%slrip 

0 

1049 

1049 

skip 

32767 

0 

32767 

trapl 

0 

1 

1 

%trap2 

16383 

0 

16383 

%trap3 

32769 

1 

32770 

% store 

0 

20 

20 

ftsaorem 

0 

515 

515 

*k»ad 

32771 

534 

33305 

loadc 

32769 

1 

32770 

%loadm 

0 

515 

515 

%and 

0 

4 

4 

%or 

0 

3 

3 

%add 

81926 

2607 

84533 

%sub 

0 

1552 

1552 

sub 

32766 

0 

32766 

%ex tract 

0 

1 

1 

%insen 

0 

3 

3 

%jump 

1033 

523 

1556 

jump 

16385 

519 

16904 

%call 

0 

4 

4 

call 

33285 

4 

33289 



















































































































































Tabic B.5: testClassOrganizer Macro-Benchmark Instruction Mix. 



ST 

system 

both | 

WU remw 

4692 

1856 

6548 I 

T1 trapl 

0 

9 

9 1 

T1 trap3 

641 

0 

641 j 

GS remw 

11 

0 

11 i 

GS store 

14 

0 

14 i 

loadm 8 

0 

7037 

7037 

i 

| 

stoiem 5 

0 

H 

11 

storem 7 

5320 

435 

5755 i 

stoiem 8 

0 

7037 

7037 i 

ret*w’s 

75521 

55960 

131481 i 

nooNU8-14 

192227 

224667 

416894 i 

int8-14 

68182 

131169 

199351 | 


always %skip 
It %skip 
h skip 
ge %skip 
geskip 
getnpl 
eq%skip 
cq skip 
eq %trapl 
eq trapl 
eq %trap4 
ne %skip 
ne%tmpl 
netnpl 
oe%trap3 
le %skip 
le skip 
gt%skip 
gtskip 
gt%trapl 
gt trapl 
Itu/inO %skip 
geu/outO %skip 
geu/outO % trapl 
geu/outO trapl 
geu/outO %trap2 
leu %skip 

gtu ftskip _ 

umaggedlmm %ret 
untaggedlmm %retw 
umaggedlmm retw 
umaggedlmm %rem 
umaggedlmm retn 
umaggedlmm %remw 
; untaggedlmm remw 
umaggedlmm %reti 


3404 

0 

276 

0 

14 

0 

7179 

6013 

0 

0 

1318 

571 

0 

0 

57631 

0 

7952 

0 

459 

0 

0 

0 

0 

0 

0 

13507 

0 

_0_ 

0 

0 

0 

2051 

0 

6180 

16217 

0 


3 

8671 

5123 

8526 

144 

262 

36799 

442 

190 

1461 

12 

58466 

7612 

3982 

653 

17063 

242 

101 

5786 

136 

131 

747 

3928 

4528 

25536 

412 

1168 

791 

4623 

39 

522 

8599 

47 

13938 

83657 

22728 


4623 

39 

522 

10650 

47 

20118 

99874 

22728 















Table BJ: testClassOrsanizer Macro-Benchmark Instruction Mix. 


us tagged! mm %redw 
untaggedlmm %retinw 
unuggedlmm %skip 
untaggedlmm skip 
untaggedlmm %trapl 
untaggedlmm %load 
untaggedlmm load 
untaggedlmm loadc 
untaggedlmm %loadm 
untaggedlmm %xor 
unuggedlmm %and 
untaggedlmm %or 
unuggedlmm %add 
unuggedlmm add 
unuggedlmm %sub 
unuggedlmm sub 
unuggedlmm %extract 
unuggedlmm %insert 
ttggedlmm %skip 
ttggedlmm %trapl 
ttggedlmm %trap2 
ttggedlmm %trap4 
ttggedlmm %load 
ttggedlmm %aod 
ttggedlmm %or 
aaaedlmm %add 


srl barrel shifter savings 
forwarding cost 
two-tone savings 


111047 

209716 


Table B.6: testCompiler Macro-Benchmark instruction Mix. 


Steps 

Cycles 


370941 


system 


717817 


both 


743753 

1088758 


1557 


2 75 

947 

1 168 

179 

3 1 

1 

1 13688 

13689 


2378 


960 

3 320 

320 

2 4410 

5622 

3 81 

81 





































Table B.6: testCompiler Macro-Benchmark Instruction Mix. 



i %retnw 

2422 

7362 

9784 


| retnw 

8528 

23221 

31749 


i %reti 

0 

7170 

7170 

► 

1 

i %retiw 

0 

2729 

2729 

i 

! %retinw 

0 

75 

75 


i %sldp 

3737 

77074 

80811 

r 

1 «kip 

4810 

4342 

9152 


i % trapl 

0 

2763 

2763 


: trapl 

0 

7701 

7701 

i 

1 %trap2 

4735 

259 

4994 

k 

l 

! %trap3 

18122 

878 

19000 

• 

1 

j %trap4 

450 

19 

469 


; %store 

1880 

16253 

18133 

1 

• 

: store 

2973 

3476 

6449 

■ 

j %storcm 

1876 

3236 

5112 

• 

j %load 

36937 

65008 

101945 

' 

) load 

0 

5087 

5087 

1 

j loadc 

18121 

1235 

19356 

i 

%loadtn 

0 

3316 

3316 

b 

! %srl 

0 

9388 

9388 

■» 

1 %sra 

0 

50 

50 

4 

i sra 

0 

24 

24 

4 

! %xor 

0 

1304 

1304 

j 

! %and 

11 

13067 

13078 

I 

j and 

30 

4 

34 

« 

i %or 

451 

4818 

5269 

• 

» 

%add 

66485 

106045 

172530 

• 

1 add 

3094 

4385 

7479 

• 

i %su 

0 

5159 

5159 

1 

; 5,1 

0 

1406 

1406 

1 

0 

1 %sub 

450 

20291 

20741 

a 

: sub 

1124 

5830 

6954 

■ 

%extract 

0 

12979 

12979 

; 

: % insert 

0 

3697 

3697 

b 

; %jump 

26002 

24280 

50282 


jump 

9420 

20049 

29469 


%call 

0 

6801 

6801 


call 

34157 

2548 

36705 

* 

TT skip 

579 

1 

580 


TT loadc 

2793 

17 

2810 

> 

WOO? 

2088 

641 

2729 


WUretw 

0 

146 

146 


WU retnw 

1889 

694 

2583 

( 

J 

TI trapl 

0 

75 

75 

TItrap3 

872 

0 

872 

GS retnw 

4 

0 

4 

! 

GS store 

7 

168 

175 


loadm 7 










































Table B.6: testCompiler Macro-Benchmark Instruction Mix. 













































; Table B.6: testCompiler Macro-Benchmark Instruction Mix. 



ST 

system 

both 

untaggedlmm %retinw 

0 

75 

75 

untagged!mm 9c skip 

0 

9993 

9993 

untaggedlmm skip 

1658 

485 

2143 

untaggedlmm %trapl 

0 

74 

74 

untaggedlmm %load 

33942 

47159 

81101 

untaggedlmm load 

0 

5087 

5087 

untaggedlmm loadc 

18121 

1235 

19356 

untaggedlmm %loadm 

0 

3316 

3316 

untaggedlmm %xor 

0 

447 

447 

untaggedlmm %and 

0 

6924 

6924 

untaggedlmm and 

17 

4 

21 

untaggedlmm %or 

1 

147 

148 

untaggedlmm %add 

8059 

70090 

78149 

untaggedlmm add 

2189 

120 

2309 

untaggedlmm %sub 

450 

17456 

17906 

untaggedlmm sub 

542 

4359 

4901 

untaggedlmm ^extract 

0 

10833 

10833 

untaggedlmm %msen 

0 

2423 

2423 

tagged Imm %skip 

1058 

17170 

18228 

taggedlmm %trapl 


952 

952 

taggedlmm % trap 2 

4735 

259 

4994 

taggedlmm %trap4 

450 

19 

469 

taggedlmm %load 

2995 

11425 


taggedlmm %and 

11 


2317 

taggedlmm %or 

450 

1995 

2445 

taggedlmm 9k add 

3662 

8732 

12394 

sll barrel shifter savings 

0 

3 

3 

srl barrel shifter savings 

0 

1900 

1900 

sra barrel shifter savings 

0 

24 

24 

forwarding cost 

38049 

105324 

143373 

two-tone savings 

68706 

91028 

159734 

condJumps 

3416 

28601 

32017 


Table B.7: testOecompiler Macro-Benchmark Instruction Mix. 

*' _ 1 __ 



ST 

system 

both 

Steps 

Cycles 

936933 

1956663 

1983995 

2893596 

Ccodes 

6016 


6016 

TT 

8641 

6 

8647 

WO 

3225 

1548 

4773 

wu 

3433 

1340 

4773 

n 

3217 

6 

3223 












































testDecompiler Macro-Benchmark Instruction Mix. 


ST 

system 

both 

%retnw 

3975 

15526 

19501 

ROW 

21194 

63099 

84293 

%reti 

0 

16637 

16637 

%retiw 

0 

4773 

4773 

%retiow 

0 

6 

6 

%skip 

4601 

236206 

240807 

ikip 

15999 

8356 

24355 

%tnpl 

0 

8682 

8682 

trapl 

0 

21010 

21010 

%np2 

12417 

788 

13205 

%trap3 

45968 

3212 

49180 

%tnp4 

1088 

82 

1170 

% store 

7926 

50609 

58535 

store 

5375 

4826 

10201 

% storem 

4680 

6919 

11599 

%)oad 

88555 

196228 

284783 

load 

0 

14998 

14998 

loadc 

45962 

3836 

49798 

%loadm 

0 

4773 

4773 

%srl 

0 

17159 

17159 

STS 

0 

2120 

2120 

%xor 

0 

6329 

6329 

%and 

31 

36239 

36270 

and 

500 

0 

500 

%OT 

1088 

10335 

11423 

%add 

186538 

309908 

496446 

add 

11775 

13398 

25173 

%sll 

0 

7956 

7956 

sU 

0 

1306 

1306 

%sub 

1088 

46940 

48028 

sub 

2890 

15654 

18544 

^extract 

0 

37296 

37296 

% insert 

0 

15013 

15013 

%jump 

60263 

64066 

124329 

jump 

22167 

52476 

74643 

%call 

0 

17876 

17876 

call 

84494 

7471 

91965 

TT skip 

798 

0 

798 

TT loadc 

7843 

6 

7849 

woo? 

3225 

1548 

4773 

WU rerw 

0 

1 

1 

WU tetnw 

3433 

1339 

4772 

TI trap! 

0 

6 

6 

TItrap3 

3217 

0 

3217 

loadm 8 

0 

4773 

4773 


storem 4 
storem 6 


29 

29 


0 

0 


29 

29 
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1 Table B.7: testDecompiler Macro-Benchmark Instruction Mix. 


ST 

system 

both 

storem 7 

4622 

2146 

6768 

storem 8 

0 

4773 

4773 

ret*w*s 

55944 

48679 

104623 

nonNil8-14 

155632 

215034 

370666 

int8-14 

56499 

125311 

181810 

always %skip 

1726 

61 

1787 

lt%sldp 

0 

44547 

44547 

It skip 

797 

1852 

2649 

gefeskip 

0 

4500 

4500 

geskip 

1095 

135 

1230 

getrapl 

0 

168 

168 

eq %skip 

2683 

40869 

43552 

eq skip 

5332 

734 

6066 

eq %trapl 

0 

1022 

1022 

eq trap! 

0 

921 

921 

eq $>trap4 

1088 

82 

1170 

ne %skip 

192 

123615 

123807 

ne skip 

0 

88 

88 

ne%trapl 

0 

5499 

5499 

netrapl 

0 

2875 

2875 

ne %txap3 

45968 

3212 

49180 

le%skip 

0 

10193 

10193 

le skip 

6112 

142 

6254 

8* skip 

1865 

5405 

7270 

gt fctrapl 

0 

115 

115 

gt trap! 

0 

84 

84 

ltu/inO 9bskip 

0 

2961 

2961 

geu/outO Tfcskip 

0 

1015 

1015 

geu/outO % trapl 

0 

2046 

2046 

geu/outO trapl 

0 

16384 

16384 

geu/outO 9btrap2 

12417 

788 

13205 

leu %skip 

0 

5107 

5107 

gtu %skip 

0 

3338 

3338 

outl trapl 

0 

578 

578 

untaggedlmm %ret 

0 

6092 

6092 

unuggedlmm retw 

0 

24 

24 

untaggedlmm %rem 

4049 

8783 

12832 

untaggedlmm rem 

0 

534 

534 

untaggedlmm %retnw 

3521 

15496 

19017 

unuggedlmm rctnw 

18215 

61790 

80005 

unuggedlmm %reti 

0 

16637 

16637 

unuggedlmm %retiw 

0 

4773 

4773 

untaggedlmm %retinw 

0 

6 

6 

unuggedlmm %skip 

0 

28251 

28251 

unuggedlmm skip 

6564 

0 

6564 

untaggedlmm % trapl 

0 

379 

379 

untaggedlmm %load 

82930 

153764 

236694 

unuggedlmm load 

0 

14998 

14998 

untaggedlmm loadc 

45962 

3836 

49798 









Tabic B.7: testDecompiler Macro*Benchmark Instruction Mix. 


both 


untaggedlnun %loadm 
untaggedlnun %xor 
untaggedlnun %and 
untaggedlnun and 
untaggedlnun %or 
untaggedlnun %add 
untaggedlnun add 
untaggedlnun %sub 
untaggedlnun sub 
untaggedlnun ^extract 
untaggedlnun % insert 
taggedlmm %skip 
taggedlmm %trapl 
taggedlmm %trap2 
taggedlmm %trap4 
taggedlmm %load 
taggedlmm %and 
taggedlmm %ot 
taggedlmm %add 


sri barrel shifter savings 
forwarding cost 
two-tone savings 
condlum 


Table BJ$: testPrintDennition Macro-Benchmark Instruction Mix. 


system both 


H 


28249 































Tabic B J: 


% store 

wore 

%storem 

%load 

load 

kwdc 

%loadm 

«srl 

%xor 

%and 

and 

%or 

ftadd 

add 

%sU 

%sub 

sob 

^extract 
% insert 

%jump 

jump 

%call 

call 


TT skip 

TTloadc 

WOO? 

WUretnw 

Htrap3 

GS row 

GS store 


loadm 8 

storemS 
storem 7 
storem 8 


ret*ws 

nonNil8-14 

int8-14 


always 9feskip 


testPrintDefinition Macro-Benchmark Instruction Mix. 


bo 


214 

38 0 38 

0 47 47 

2857 5571 8428 

0 866 866 

1648 38 1686 

0 20 20 

0 868 868 

0 621 621 

0 1238 1238 

1 0 1 

0 292 292 

6277 6745 13022 

469 460 929 

0 199 199 

0 726 726 

14 908 922 

0 2031 2031 

0 750 750 


eq %mpl 


1 

0 

0 

58 

0 

4 

0 

4 

3 

1379 

7 

0 






























Table BJ: testPrintDcfinition Macro-Benchmark Instruction Mix. 



ST 


"both 

ne trapl 

0 

149 

149 

ne%trap3 

1648 

14 

1662 

lefeskip 

0 

227 

227 

leskip 

417 

0 

417 

gtfeskip 

0 

1 

1 

gtskip 

2S8 

0 

258 

gt % trapl 

0 

2 

2 

ft trapl 

0 

2 

2 

Itu/inO %sldp 

0 

39 

39 

geu/outO %skip 

0 

199 

199 

geu/outO % trapl 

0 

202 

202 

geu/outO trap! 

0 

1083 

1083 

geu/outO %trap2 

282 

2 

284 

leu %skip 

0 

64 

64 

gtu %skip 

0 

42 

42 

untaggedlmm %ret 

0 

161 

161 

untaggedlmm %retw 

0 

3 

3 

untaggedlmm %ien 

38 

43 

81 

untaggedlmm feretnw 

160 

324 

484 

untaggedlmm retnw 

360 

2482 

2842 

untaggedlmm %ren 

0 

250 

250 

untaggedlmm %ietiw 

0 

20 

20 

untaggedlmm %skip 

0 

992 

992 

untaggedlmm skip 

26 

0 

26 

untaggedlmm % trapl 

0 

4 

4 

untaggedlmm %load 

2725 

3667 

6392 

untaggedlmm load 

0 

866 

866 

untaggedlmm loadc 

1648 

38 

1686 

untaggedlmm %loadm 

0 

20 

20 

untaggedlmm %aor 

0 

217 

217 

untaggedlmm %and 

0 

910 

910 

untaggedlmm %add 

329 

4601 

4930 

untaggedlmm add 

469 

0 

469 

untaggedlmm %sub 

0 

693 

693 

untaggedlmm sub 

8 

679 

687 

untaggedlmm %extract 

0 

1423 

1423 

untaggedlmm % insen 

0 

114 

114 

taggedlmm %skip 

7 

531 

538 

taggedlmm % trapl 

0 

202 

202 

taggedlmm %trap2 

282 

2 

284 

taggedlmm %load 

132 

972 

1104 

taggedlmm %and 

0 

55 

55 

taggedlmm %or 

0 

53 

53 

taggedlmm %add 

412 

193 

605 

sri barrel shifter savings 

0 

434 

434 

forwarding cost 

2770 

9784 

12554 

two-tone savings 

5800 

9344 

15144 

condJumps 

299 

1281 

1580 






















testPrinlHierarcby Macro* Benchmark Instruction Mix. 


ST 

system 

both 

Steps 

Cycles 

30458 

87127 

82833 

117585 

Ccodes 

193 

0 

193 

TT 

86 

0 

86 

WO 

117 

26 

143 

wu 

81 

62 

143 

T1 

85 

3 

88 

G5 

24 

0 

24 

fenop 

1 

169 

170 

feret 

0 

249 

249 

feretw 

0 

48 

48 

feretn 

109 

256 

365 

retn 

0 

8 

8 

ferctnw 

208 

396 

604 

retnw 

618 

2329 

2947 

feretJ 

0 

338 

338 

feietiw 

0 

143 

143 

feretinw 

0 

3 

3 

feskip 

261 

8996 

9257 

skip 

545 

35 

580 

fetrapl 

0 

1148 

1148 

trapl 

0 

1324 

1324 

fenp2 

303 

6 

309 

femp3 

1657 

98 

1755 

fetnp4 

13 

0 

13 

festorc 

176 

2595 

2771 

store 

57 

30 

87 

festorcm 

52 

265 

317 

feload 

2908 

10094 

13002 

load 

0 

890 

890 

loadc 

1657 

117 

1774 

feloadm 

0 

167 

167 

fesrl 

0 

1308 

1308 

fexor 

0 

1782 

1782 

feand 

0 

1955 

1955 

and 

4 

0 

4 

feor 

13 

643 

656 

feadd 

6770 

12942 

19712 

add 

452 

155 

607 

fesll 

0 

22 

22 

$11 

0 

3 

3 

fesub 

13 

2218 

2231 

sub 

50 

513 

563 

feextract 

0 

2567 

2567 

fe insert 

0 

1734 

1734 


%jump 

jump 

%call 


1870 

828 

0 


2509 

2284 

547 


4379 

3112 

547 






























1 Table B.9: testPrintHierarchy Macro-Benchmark Instruction Mix. 


ST 

system 

both 

call 

2986 

202 

3188 

TT skip 

12 

0 

12 

TTIoadc 

74 

0 

74 

WOO? 

117 

26 

143 

WU retnw 

81 

62 

143 

T1 tntpl 

0 

3 

3 

T1 tntp3 

85 

0 

85 

GS retnw 

12 

0 

12 

GS store 

12 

0 

12 

loadm 7 

0 

24 

24 

loadm 8 

0 

143 

143 

storemS 

0 

12 

12 

storetn 7 

52 

110 

162 

storem 8 

0 

143 

143 

ret*w’s 

1888 

1702 


nooNiI8-14 

5682 

8475 

14157 

int8-14 

1344 

4622 

5966 

always %skip 

45 

0 

45 

It %skip 

0 

1362 

1362 

It skip 

7 

11 

18 

ge %skip 

0 

9 

9 

ge skip 

5 

4 

9 

ge tntpl 

0 

8 

8 

eq %skip 

216 

1552 

1768 

eq skip 

51 

4 

55 

eq %trapl 

0 

24 

24 

eq tntpl 

0 

4 

4 

eq %trap4 

13 

0 

13 

ne %skip 

0 

5543 

5543 

ne skip 

0 

2 

2 

ne %tnpl 

0 

750 

750 

netxapl 

0 

152 

152 

ne %tnp3 

1657 

98 

1755 

le%skip 

0 

108 

108 

le skip 

377 

0 

377 

gtftsirip 

0 

12 

12 

gt skip 

93 

14 

107 

gt % trap I 

0 

4 

4 i 

gttrapl 

0 

4 

4 

Itu/inO %skip 

0 

108 

108 

geu/outO %skip 

0 

12 

12 

geu/outO %trap) 

0 

370 

370 

geu/outO trap! 

0 

1154 

1154 

geu/outO %trap2 

303 

6 

309 

leu %skip 

0 

182 

182 

gtu %skip 

0 

108 

108 

outl tnpl 

0 

2 

2 

untaggedlmm %ret 

0 

237 

_231, 
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1 Table B.9: testPrintHierarchv Macro*Benchmark Instruction Mix. 


ST 

system 

both 

untaggedlmm %retw 

0 

36 

36 

untaggedlmm %rem 

109 

256 

365 

untaggedlmm tern 

0 

8 

8 

untaggedlmm %retnw 

200 

390 

590 

untaggedlmm retnw 

545 

2273 

2818 

untaggedlmm %reti 

0 

338 

338 

untaggedlmm %redw 

0 

143 

143 

untaggedlmm %retinw 

0 

3 

3 

untaggedlmm %skip 

0 

1983 

1983 

untaggedlmm skip 

81 

0 

81 

untaggedlmm %trapl 

0 

8 

8 

untaggedlmm %load 

2504 

7300 

9804 

untaggedlmm load 

0 

890 

890 

untaggedlmm loadc 

1657 

117 

1774 

untaggedlmm feloadm 

0 

167 

167 

untaggedlmm %xor 

0 

370 

370 

untaggedlmm %and 

0 

887 

887 

untaggedlmm %add 

728 

9243 

9971 

untaggedlmm add 

441 

1 

442 

untaggedlmm %sub 

13 

2067 

2080 

untaggedlmm sub 

23 

417 

440 

untaggedlmm %extract 

0 

891 

891 

untaggedlmm % insert 

0 

288 

288 

tagged! mm %skip 

35 

1752 

1787 

taggedlmm %tnpl 

0 

370 

370 

taggedlmm %trap2 

303 

6 

309 

taggedlmm %trap4 

13 

0 

13 

taggedlmm %load 

404 

1882 

2286 

taggedlmm %and 

0 

241 

241 

taggedlmm %or 

13 

195 

208 

taggedlmm %add 

37 

669 

706 

srl barrel shifter savings 

0 

647 

647 

forwarding cost 

3166 

18485 

21651 

two-tone savings 

6621 

8649 

15270 

condJumps 

363 

2385 

2748 


BJ. Execution Profile Data 

The data in this section were derived by modifying the simulator to sample its PC 
every 100 cycles, and using an awk [AKW] program to merge the samples with assembler’s 
symbol table. Instrumenting the simulator instead of the SOAR program enables us to 
profile the program without altering its behavior. All times listed in this appendix are given 
as a percentage of the total time. For an explanation of the primitive numbers, see the 
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Smalltalk-80 book by Goldberg and Robson [GoR83]. The more obscure labels can only be 
understood by reading our code. 





.12: testClL t'Oreanizer Macro-Benchmark Execution Time Profile. 


mall talk 
WindowOverflowT rapH 
SlQuoPrm 
StringAtPrm 
Prim_60 
WSNextPutPrm 
WindowUnderflowTrapH 
SIMulPrm 

StringReplaceFromToWithStaningPnn 

lookupMethodlnClass 

Prim_62 

SkipTagTrapH 

RSNextPrm 

SISISIPnn 

LoadcTagTrapH 

BebavNew 

BCValuePnn2 

SYS_word_fiU 

SkipOnTnie 

SILTPrm 

SkipTagTrapS 

Prim_61 

Prim_110 

SkipTagTrapHidone 

blockCopy 

lookup 

Prim_81 

PSAtEndPrm 

bloc kArrow Return 

FailPrm 
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Tabic B.13: tesiCompiler Macro-Benchmark Execution Time Profile. 


0.3% Primjl 

0.3% Prim 71 

03% Prim~70 

03% Primj 10 

0.2% methodBlockCopy 

0.2% insert!04!sel!heie 

03% getWordSim 

; 03% eqNewNewBccomc 

: 03% irgumen(Count 

03% SkipTagTrapH.’done 
03% SlripOnFalse 

03% Prim 111 

! 03% PSAtEndPrm 

j 0.1% inseit!03!sel!herc 

: 0.1% gsSurvivors 

| 0.1% gsStoreGSTrapS 

j 0.1% gsRemembered 

0.1% SVTracc 

i 0.1% Prim_83 

i 0.1% Prim~75 

0.1% FailPrm 


Table B.14: tes (Decompiler Macro-Benchmark Execution Time Profile. 



Smalltalk 

213% 

lookupMetbodlnClass 

7% 

BehavNew 

3.8% 

Prim_60 

3.7% 

WindowOverflowTiapH 

3.7% 

WSNextPutPrm 

33% 

SYS_word_fill 

3.1% 

Rim_61 

2.7% 

WindowUnderflowTrapH 

2.1% 

SIQuoPrm 

2% 

lookup 

13% 

StringReplaceFromToW i thS tamngPrm 

13% 

StringAtPrm 

1.3% 

Prim_62 

1.1% 

allocSpace 

1.1% 

SlSlSlPrm 

1.1% 

BCValuePrm2 

1% 

blockCopy 

0.9% 

SIMulPnn 

0.8% 

LoadcTagTrapH 

03% 

cacheMissLookup 

03% 

SkipOnTme 

03% 

Prim_71 

03% 

Prim_70 

0.4% 

Try Right 

0.3% 

other 

03% 

SkipTagTrapH 




































































































